feat(backend): add DingTalk document to knowledge base MCP tool#971
feat(backend): add DingTalk document to knowledge base MCP tool#971parabala wants to merge 20 commits intowecode-ai:mainfrom
Conversation
Add support for importing DingTalk documents into Wegent knowledge bases.
Features:
- New MCP tools for DingTalk document operations:
- get_dingtalk_document_info: Get document metadata from URL
- add_dingtalk_doc_to_knowledge: Add document with text content
- add_dingtalk_doc_with_attachment: Add document using attachment
- DingTalk Docs Service (app/services/dingtalk/):
- Document URL parsing and ID extraction
- Document metadata retrieval
- Filename generation with naming convention: {title}_{timestamp}.md
- DingTalk Docs Skill (init_data/skills/dingtalk-docs/):
- dingtalk_doc_to_kb tool for complete workflow:
1. Start sandbox environment
2. Fetch document info from DingTalk
3. Download document content
4. Save as {title}_{timestamp}.md
5. Upload as attachment
6. Create knowledge base document
The timestamp format follows YYYYMMDDHHMMSS convention.
Files added:
- app/services/dingtalk/__init__.py
- app/services/dingtalk/docs_service.py
- app/mcp_server/tools/dingtalk_docs.py
- init_data/skills/dingtalk-docs/SKILL.md
- init_data/skills/dingtalk-docs/__init__.py
- init_data/skills/dingtalk-docs/provider.py
Files modified:
- app/mcp_server/server.py (import dingtalk_docs module)
📝 WalkthroughWalkthroughAdds DingTalk document support: registers new MCP tools, implements a DingTalk docs service that calls MCP tools and external endpoints, adds a sandboxed skill/provider to download/upload/import documents into knowledge bases, and extends the knowledge orchestrator to accept and persist source metadata. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Tool as DingTalkDocToKBTool
participant Sandbox as Sandbox Env
participant MCP as MCP Server (knowledge)
participant Service as DingTalkDocsService
participant Backend as Backend (attachments/orchestrator)
User->>Tool: run(dingtalk_doc_url, knowledge_base_id)
Tool->>Sandbox: create sandbox / emit running
Sandbox->>MCP: tools/call get_dingtalk_document_info(doc_url)
MCP->>Service: dispatch get_document_info
Service-->>MCP: metadata (doc_id,title,updateTime)
MCP-->>Sandbox: metadata response
Sandbox->>MCP: tools/call download_dingtalk_document(doc_url, format=markdown)
MCP->>Service: dispatch download_document_content
Service-->>MCP: content + metadata
MCP-->>Sandbox: document content
Sandbox->>Backend: upload attachment (multipart)
Backend-->>Sandbox: attachment_id
Sandbox->>MCP: tools/call add_dingtalk_doc_with_attachment(kb_id, attachment_id,...)
MCP->>Backend: create_document_with_content(..., source_config)
Backend-->>MCP: document created (document_id)
MCP-->>Sandbox: success response
Sandbox-->>User: completed result (document_id, attachment_id, filename)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (1)
backend/init_data/skills/dingtalk-docs/provider.py (1)
672-685: Centralize filename generation instead of duplicating it here.This sanitizer now exists in both
backend/init_data/skills/dingtalk-docs/provider.pyandbackend/app/services/dingtalk/docs_service.py. Keeping two copies will drift the next time the naming rules change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 672 - 685, Duplicate filename-sanitization logic exists in _build_filename (class in provider.py) and in docs_service; extract the logic into a single shared function (e.g., sanitize_doc_filename(title: str, modified_time: str) or build_doc_filename(title, modified_time)) in a common module and have _build_filename call that function instead of reimplementing the regex/untitled fallback; update docs_service to import and use the same shared function so both places reference one implementation and remove the duplicated regex/strip/untitled code from _build_filename.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 143-153: The tool currently treats doc_content as required and
returns an error if missing; update add_dingtalk_doc_to_knowledge (or the
function that contains this snippet) to honor the advertised fallback by
attempting to fetch the document from DingTalk when doc_content is falsy: call
the existing DingTalk fetch helper (e.g., fetch_dingtalk_doc,
fetch_doc_content_from_dingtalk, or dingtalk_client.get_document) using the
provided doc_id/metadata, assign the result to doc_content, and only return the
error if that fetch fails or returns empty; ensure the final returned dict
structure remains the same and includes the fetched content on success.
- Around line 66-70: The current get_dingtalk_document_info handler calls
asyncio.run(...) (invoking dingtalk_docs_service.get_document_info) which fails
inside FastMCP's running event loop; change the handler to be async (declare
async def get_dingtalk_document_info(...)) and await
dingtalk_docs_service.get_document_info(doc_url) directly, or alternatively
replace asyncio.run(...) with anyio.to_thread.run_sync(lambda: asyncio.run(...))
/ anyio.to_thread.run_sync(lambda:
dingtalk_docs_service.get_document_info(doc_url)) to run the blocking call off
the loop; update references to get_dingtalk_document_info and ensure its callers
support async invocation.
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 143-162: get_document_info currently fabricates metadata; replace
the placeholder body in get_document_info(doc_id, doc_url) with a real DingTalk
API lookup that uses the service's credentials/token (e.g. the client or auth
field on this service) to fetch the document metadata (title, last modified
timestamp, content type and canonical URL) and map those into the return shape
(doc_id, title, modified_time ISO, modified_time_formatted as "%Y%m%d%H%M%S",
content_type, url). Ensure you handle API errors (raise or log and propagate
appropriately) and normalize the timestamp from the API into modified_time and
modified_time_formatted; preserve the same keys so downstream code expecting
title/modified_time/etc. continues to work.
- Around line 191-209: download_document_content currently returns placeholder
markdown instead of fetching real document data; replace the stub in
download_document_content to call DingTalk's export/download APIs using the
existing HTTP client or SDK, pass doc_info["doc_id"] and export_format to
request the exported file, download and decode the returned content, set
file_extension based on export_format, populate "content", "title",
"modified_time", "modified_time_formatted", and "doc_id" with the real values
from the response, and add error handling to raise or log failures instead of
returning fake content (refer to download_document_content, doc_info and
export_format to locate the implementation).
In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 465-499: The _download_document_content() helper currently returns
a hardcoded markdown placeholder instead of the real DingTalk document body;
replace the placeholder logic in _download_document_content (search for function
name _download_document_content and the doc_id extraction block) with a call to
the DingTalk document API using the extracted doc_id (use existing auth/token
helpers in the module or skill to obtain credentials), fetch the actual document
body (convert/format to markdown if needed), handle HTTP errors and empty
responses, and return {"success": True, "content": actual_content} (or
{"success": False, "error": ...} on failures) so imports persist real content
rather than canned text.
- Around line 77-82: The code currently falls back to inserting a
machine-specific path into sys.path to import BaseSandboxTool (the
sys.path.insert(...) branch before importing _base and BaseSandboxTool); remove
that fallback entirely so the module only tries the legitimate import locations
and then raises ImportError if not found. Update the try/except around importing
BaseSandboxTool to eliminate the hard-coded "/workspace/1760905/Wegent/..."
sys.path insertion and ensure the except ImportError simply raises a clear
ImportError mentioning BaseSandboxTool, referencing the import attempt of
BaseSandboxTool from _base.
- Around line 533-546: The current code builds a shell string in curl_cmd and
calls sandbox.commands.run(cmd=curl_cmd), which is vulnerable because
auth_token, file_path and upload_url are interpolated raw; fix it by avoiding a
raw shell string—either construct argv-style input (pass a list/sequence of args
to sandbox.commands.run instead of a single shell string) or reliably escape
each interpolated value using shlex.quote before composing the command; update
the code paths around curl_cmd and the sandbox.commands.run call (and any
helpers like _build_filename if used here) so the values for auth_token,
file_path and upload_url are not interpreted by the shell.
- Around line 605-633: The current approach embeds user-controlled doc_title
into a single-quoted curl command string (curl_cmd) and passes it to
sandbox.commands.run, which allows shell injection if doc_title contains quotes;
instead, avoid constructing a shell command with interpolated JSON—either (a)
perform the HTTP call directly in Python (e.g., use aiohttp/requests) and send
the JSON payload built from payload variable, or (b) write json.dumps(payload)
to a temporary file and invoke sandbox.commands.run with curl using -d
`@/tmp/file` (or pass the JSON via stdin) so no user input is injected into a
shell string; update the code paths that build payload, curl_cmd and the
sandbox.commands.run invocation to use one of these safe approaches and ensure
auth_token and mcp_url are passed as headers/args rather than concatenated into
a shell line.
---
Nitpick comments:
In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 672-685: Duplicate filename-sanitization logic exists in
_build_filename (class in provider.py) and in docs_service; extract the logic
into a single shared function (e.g., sanitize_doc_filename(title: str,
modified_time: str) or build_doc_filename(title, modified_time)) in a common
module and have _build_filename call that function instead of reimplementing the
regex/untitled fallback; update docs_service to import and use the same shared
function so both places reference one implementation and remove the duplicated
regex/strip/untitled code from _build_filename.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b7eb8315-74f6-465f-bbe1-0c282dba752f
📒 Files selected for processing (7)
backend/app/mcp_server/server.pybackend/app/mcp_server/tools/dingtalk_docs.pybackend/app/services/dingtalk/__init__.pybackend/app/services/dingtalk/docs_service.pybackend/init_data/skills/dingtalk-docs/SKILL.mdbackend/init_data/skills/dingtalk-docs/__init__.pybackend/init_data/skills/dingtalk-docs/provider.py
Fixes applied: 1. Fix asyncio.run issue in get_dingtalk_document_info - Changed to async def and await directly 2. Fix add_dingtalk_doc_to_knowledge to fetch content when missing - Added logic to fetch document from DingTalk when doc_content is not provided - Falls back to downloading content via dingtalk_docs_service 3. Fix hardcoded path in provider.py import - Removed hardcoded /workspace path from sys.path.insert - Cleaned up import error handling 4. Fix shell injection vulnerabilities in curl commands - Use shlex.quote for command arguments - Pass commands as arrays instead of shell strings - Write JSON payload to temp file instead of embedding in command 5. Fix duplicate filename sanitization logic - Extracted sanitize_filename and build_dingtalk_doc_filename as shared functions - Updated docs_service.py to use shared functions - Updated provider.py to import from shared module All Python files pass syntax check and are formatted with black.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
backend/app/services/dingtalk/docs_service.py (2)
181-200:⚠️ Potential issue | 🟠 MajorPlaceholder implementation returns synthetic metadata.
get_document_info()fabricatestitleandmodified_timefrom the URL and local clock instead of calling the DingTalk API. This means downstream filenames and KB records will contain synthetic data rather than actual document metadata.This has been flagged in a previous review. Ensure this is addressed before the feature is production-ready.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 181 - 200, The current get_document_info function returns synthetic metadata (title, modified_time, modified_time_formatted) built from doc_id and local clock; replace this placeholder by calling the DingTalk document metadata API to fetch real values (title, lastModified, content type, url) and map them into the returned dict (fields title, modified_time, modified_time_formatted, content_type, url, doc_id); preserve a safe fallback path that logs the API error and only then uses the synthetic values so downstream code (file naming and KB records) receives real metadata when available and predictable fallback data when the API fails.
229-247:⚠️ Potential issue | 🟠 MajorPlaceholder implementation returns dummy content.
download_document_content()returns hardcoded placeholder markdown instead of fetching actual document content from DingTalk. Any KB entry created through this path will contain fake content.This has been flagged in a previous review. Ensure this is addressed before the feature is production-ready.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 229 - 247, download_document_content currently returns hardcoded markdown; replace the placeholder logic in download_document_content to call DingTalk's document export/download APIs using the doc_info and export_format, download the exported content (handling possible export conversion for "markdown" vs other formats), parse/validate the response and populate "content", "title", "modified_time", "modified_time_formatted", "file_extension" and "doc_id" from the real API response, and add robust error handling and retries (log and raise on unrecoverable failures) so KB entries contain the actual document content rather than dummy text.backend/init_data/skills/dingtalk-docs/provider.py (2)
408-452:⚠️ Potential issue | 🟠 MajorPlaceholder:
_get_document_info_from_mcpdoes not call MCP or DingTalk.This method extracts the document ID from the URL and fabricates a synthetic title/timestamp rather than fetching real metadata. The workflow cannot retrieve actual document information.
This has been flagged in a previous review.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 408 - 452, The _get_document_info_from_mcp function currently only parses doc_id from doc_url and synthesizes metadata; replace the placeholder logic with a real MCP/DingTalk metadata fetch using the provided sandbox (use the sandbox/tool client or the MCP helper available in this environment) to call the appropriate MCP endpoint with the extracted doc_id (or the full doc_url) and return the real fields: success (bool), doc_id, title, modified_time (ISO), modified_time_formatted; preserve the existing error handling pattern (return {"success": False, "error": ...}) and ensure you fall back to an informative error when the MCP call fails or returns no data so callers of _get_document_info_from_mcp get real metadata instead of synthetic values.
454-497:⚠️ Potential issue | 🟠 MajorPlaceholder:
_download_document_contentreturns template content.This method generates placeholder markdown instead of downloading the actual DingTalk document body. KB entries created through this path will contain dummy text.
This has been flagged in a previous review.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 454 - 497, _download_document_content currently returns placeholder markdown instead of fetching the real document body; replace the template generation with an actual fetch from DingTalk via the MCP/client used elsewhere (use the same client or helper that performs authenticated requests), extract the document ID from doc_url (doc_id) using the existing patterns, call the appropriate MCP/DingTalk API endpoint to retrieve the document content (handle JSON/html responses), convert/clean the response into the expected markdown/plain string, and return {"success": True, "content": content}; ensure robust error handling around the network call (catch and log exceptions and return {"success": False, "error": str(e)}), and reuse existing symbols/functions for HTTP auth/client if present to avoid duplicating credentials.
🧹 Nitpick comments (8)
backend/app/mcp_server/tools/dingtalk_docs.py (2)
239-246: Consider making this handlerasyncfor consistency.
add_dingtalk_doc_with_attachmentis defined as a synchronous function (def) while the other two tools areasync def. If FastMCP handles both patterns correctly, this is fine, but async consistency would be cleaner and would allow usingasyncio.to_thread()for the blocking orchestrator call.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 239 - 246, The function add_dingtalk_doc_with_attachment is synchronous while similar handlers are async; change its signature to async def add_dingtalk_doc_with_attachment(...) and update its internals to run the blocking orchestrator call via asyncio.to_thread (or await an async wrapper) so the handler is non-blocking and consistent with the other async tools; ensure any callers or tests that expect the old sync signature are updated to await add_dingtalk_doc_with_attachment and preserve existing behavior for trigger_indexing/trigger_summary and return type Dict[str, Any].
180-186: Movedatetimeimport to module level.The
datetimeimport is performed twice inside the function at runtime. Since it's already used elsewhere in the codebase, move the import to the top of the file for consistency and minor performance improvement.♻️ Proposed fix
Add to module-level imports:
import logging +from datetime import datetime from typing import Any, Dict, OptionalThen remove the inline imports at lines 180 and 184:
logger.warning( f"Invalid modified_time format: {modified_time}, using current time" ) - from datetime import datetime - modified_time = datetime.now().strftime("%Y%m%d%H%M%S") else: - from datetime import datetime - modified_time = datetime.now().strftime("%Y%m%d%H%M%S")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 180 - 186, Move the repeated inline imports of datetime to the module level: add a single module-level line "from datetime import datetime" to the top of the file, then remove the two inline "from datetime import datetime" statements found near the assignments to modified_time (the blocks that set modified_time = datetime.now().strftime("%Y%m%d%H%M%S")). Ensure the existing uses of modified_time still refer to the now-imported datetime and run tests/lint to confirm no unused-import warnings.backend/app/services/dingtalk/docs_service.py (2)
17-17: Unused import:httpxThe
httpxmodule is imported but never used in the current implementation. This is likely intended for future DingTalk API calls.♻️ Suggested fix
Either remove the unused import until needed:
-import httpx - from app.core.config import settingsOr add a comment indicating it's reserved for future use:
import httpx # Reserved for DingTalk API calls🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` at line 17, The import httpx at module-level in docs_service.py is unused; either remove the import statement entirely or retain it with an explicit comment (e.g., "Reserved for DingTalk API calls") to indicate intentional future use so linters don't flag it — update the module-level import for httpx accordingly.
141-147: Log the parse failure before falling through.The outer
except Exception: passswallows all errors silently. While the fallback to current time is acceptable, logging the exception would aid debugging parse failures.♻️ Proposed fix
except ValueError: continue - except Exception: - pass + except Exception as e: + logger.warning(f"Unexpected error parsing modified_time '{modified_time}': {e}") # Return current time as fallback logger.warning(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 141 - 147, The outer "except Exception: pass" in the block that tries to parse modified_time silently swallows failures; update that outer handler to log the exception (including exception details) and context (the modified_time value and attempted fmt) before falling back to current time—modify the outer except around the datetime.strptime/strftime loop to call the module/class logger (or existing logger instance) with a descriptive message and the caught exception information so parse failures are visible for debugging.backend/app/services/dingtalk/__init__.py (1)
14-19: Sort__all__alphabetically for consistency.Static analysis flags that
__all__is not sorted. Per isort standards, sorting helps maintainability.♻️ Proposed fix
__all__ = [ + "build_dingtalk_doc_filename", "DingTalkDocsService", "dingtalk_docs_service", "sanitize_filename", - "build_dingtalk_doc_filename", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/__init__.py` around lines 14 - 19, The __all__ list in this module is not alphabetically sorted; reorder the entries in __all__ so they are lexicographically sorted (e.g., "DingTalkDocsService" before "build_dingtalk_doc_filename", "dingtalk_docs_service", and "sanitize_filename") to satisfy static analysis/isort expectations and maintain consistency.backend/init_data/skills/dingtalk-docs/provider.py (3)
75-77: Useraise ... from Noneto clarify exception origin.Per B904, when re-raising in an
exceptblock, useraise ... from Noneto indicate the exception is intentional and not from mishandled exception processing.♻️ Proposed fix
else: - raise ImportError( + raise ImportError( "Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base" - ) + ) from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 75 - 77, The ImportError being raised for failing to import BaseSandboxTool from chat_shell.tools.sandbox._base should be re-raised using exception chaining suppression; modify the raise in the provider module where BaseSandboxTool is checked so it uses "raise ImportError(... ) from None" to indicate the error is intentional and not caused by a prior exception.
461-462: Remove redundantimport reinside method.Same issue -
reis already imported at module level.♻️ Proposed fix
try: # Extract doc ID for content generation - import re - doc_id = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 461 - 462, Remove the redundant local import of the regex module inside the method (the `import re` line shown near the "Extract doc ID for content generation" block) since `re` is already imported at module level; simply delete that inner import so the method uses the module-level `re` symbol (locate the local import inside the provider.py function that extracts the doc ID and remove it).
415-416: Remove redundantimport reinside method.
reis already imported at the module level (line 20). The local import is unnecessary.♻️ Proposed fix
try: # Extract doc ID from URL - import re - doc_id = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 415 - 416, Remove the redundant local "import re" inside the method that extracts the doc ID from a URL and rely on the module-level "re" import instead; delete the inline import statement where the comment "Extract doc ID from URL" appears (the duplicate import inside that function) so the function uses the already-imported regex module, then run linters/tests to confirm no unused-import warnings remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 198-207: The call to
knowledge_orchestrator.create_document_with_content is a synchronous, blocking
operation called inside an async handler; wrap this call using asyncio.to_thread
and await it so the blocking DB/file/URL work runs off the event loop (i.e.,
replace the direct call to
knowledge_orchestrator.create_document_with_content(...) with an awaited
asyncio.to_thread invocation of that function, passing the same named arguments:
db, user, knowledge_base_id, name/title, source_type, content/doc_content,
trigger_indexing, trigger_summary).
In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 624-627: The hardcoded payload_file ("/tmp/mcp_payload.json") can
cause race conditions; change the code that builds/writes payload_file (where
payload_json is created and sandbox.files.write is called) to create a secure
unique temp path using Python's tempfile (e.g., tempfile.mkstemp() or
tempfile.NamedTemporaryFile) or append a uuid, write payload_json to that unique
filename, and ensure you close/remove the temp file after use; update references
to payload_file accordingly so sandbox.files.write uses the generated unique
path.
---
Duplicate comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 181-200: The current get_document_info function returns synthetic
metadata (title, modified_time, modified_time_formatted) built from doc_id and
local clock; replace this placeholder by calling the DingTalk document metadata
API to fetch real values (title, lastModified, content type, url) and map them
into the returned dict (fields title, modified_time, modified_time_formatted,
content_type, url, doc_id); preserve a safe fallback path that logs the API
error and only then uses the synthetic values so downstream code (file naming
and KB records) receives real metadata when available and predictable fallback
data when the API fails.
- Around line 229-247: download_document_content currently returns hardcoded
markdown; replace the placeholder logic in download_document_content to call
DingTalk's document export/download APIs using the doc_info and export_format,
download the exported content (handling possible export conversion for
"markdown" vs other formats), parse/validate the response and populate
"content", "title", "modified_time", "modified_time_formatted", "file_extension"
and "doc_id" from the real API response, and add robust error handling and
retries (log and raise on unrecoverable failures) so KB entries contain the
actual document content rather than dummy text.
In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 408-452: The _get_document_info_from_mcp function currently only
parses doc_id from doc_url and synthesizes metadata; replace the placeholder
logic with a real MCP/DingTalk metadata fetch using the provided sandbox (use
the sandbox/tool client or the MCP helper available in this environment) to call
the appropriate MCP endpoint with the extracted doc_id (or the full doc_url) and
return the real fields: success (bool), doc_id, title, modified_time (ISO),
modified_time_formatted; preserve the existing error handling pattern (return
{"success": False, "error": ...}) and ensure you fall back to an informative
error when the MCP call fails or returns no data so callers of
_get_document_info_from_mcp get real metadata instead of synthetic values.
- Around line 454-497: _download_document_content currently returns placeholder
markdown instead of fetching the real document body; replace the template
generation with an actual fetch from DingTalk via the MCP/client used elsewhere
(use the same client or helper that performs authenticated requests), extract
the document ID from doc_url (doc_id) using the existing patterns, call the
appropriate MCP/DingTalk API endpoint to retrieve the document content (handle
JSON/html responses), convert/clean the response into the expected
markdown/plain string, and return {"success": True, "content": content}; ensure
robust error handling around the network call (catch and log exceptions and
return {"success": False, "error": str(e)}), and reuse existing
symbols/functions for HTTP auth/client if present to avoid duplicating
credentials.
---
Nitpick comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 239-246: The function add_dingtalk_doc_with_attachment is
synchronous while similar handlers are async; change its signature to async def
add_dingtalk_doc_with_attachment(...) and update its internals to run the
blocking orchestrator call via asyncio.to_thread (or await an async wrapper) so
the handler is non-blocking and consistent with the other async tools; ensure
any callers or tests that expect the old sync signature are updated to await
add_dingtalk_doc_with_attachment and preserve existing behavior for
trigger_indexing/trigger_summary and return type Dict[str, Any].
- Around line 180-186: Move the repeated inline imports of datetime to the
module level: add a single module-level line "from datetime import datetime" to
the top of the file, then remove the two inline "from datetime import datetime"
statements found near the assignments to modified_time (the blocks that set
modified_time = datetime.now().strftime("%Y%m%d%H%M%S")). Ensure the existing
uses of modified_time still refer to the now-imported datetime and run
tests/lint to confirm no unused-import warnings.
In `@backend/app/services/dingtalk/__init__.py`:
- Around line 14-19: The __all__ list in this module is not alphabetically
sorted; reorder the entries in __all__ so they are lexicographically sorted
(e.g., "DingTalkDocsService" before "build_dingtalk_doc_filename",
"dingtalk_docs_service", and "sanitize_filename") to satisfy static
analysis/isort expectations and maintain consistency.
In `@backend/app/services/dingtalk/docs_service.py`:
- Line 17: The import httpx at module-level in docs_service.py is unused; either
remove the import statement entirely or retain it with an explicit comment
(e.g., "Reserved for DingTalk API calls") to indicate intentional future use so
linters don't flag it — update the module-level import for httpx accordingly.
- Around line 141-147: The outer "except Exception: pass" in the block that
tries to parse modified_time silently swallows failures; update that outer
handler to log the exception (including exception details) and context (the
modified_time value and attempted fmt) before falling back to current
time—modify the outer except around the datetime.strptime/strftime loop to call
the module/class logger (or existing logger instance) with a descriptive message
and the caught exception information so parse failures are visible for
debugging.
In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 75-77: The ImportError being raised for failing to import
BaseSandboxTool from chat_shell.tools.sandbox._base should be re-raised using
exception chaining suppression; modify the raise in the provider module where
BaseSandboxTool is checked so it uses "raise ImportError(... ) from None" to
indicate the error is intentional and not caused by a prior exception.
- Around line 461-462: Remove the redundant local import of the regex module
inside the method (the `import re` line shown near the "Extract doc ID for
content generation" block) since `re` is already imported at module level;
simply delete that inner import so the method uses the module-level `re` symbol
(locate the local import inside the provider.py function that extracts the doc
ID and remove it).
- Around line 415-416: Remove the redundant local "import re" inside the method
that extracts the doc ID from a URL and rely on the module-level "re" import
instead; delete the inline import statement where the comment "Extract doc ID
from URL" appears (the duplicate import inside that function) so the function
uses the already-imported regex module, then run linters/tests to confirm no
unused-import warnings remain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 815d6cfc-0722-43c9-a13e-c10817cb7e42
📒 Files selected for processing (4)
backend/app/mcp_server/tools/dingtalk_docs.pybackend/app/services/dingtalk/__init__.pybackend/app/services/dingtalk/docs_service.pybackend/init_data/skills/dingtalk-docs/provider.py
Critical fixes: 1. Remove duplicate _get_user_from_token - Import from knowledge.py instead of reimplementing 2. Implement real DingTalk API calls via MCP - get_document_info now calls dingtalk_docs MCP 'get_document_info' tool - download_document_content now calls dingtalk_docs MCP 'download_document' tool - Added _call_dingtalk_mcp_tool helper for MCP communication - Added _get_dingtalk_docs_mcp_config to get user MCP settings 3. User preferences integration - MCP tools now pass user.preferences to service layer - Service reads dingtalk_docs MCP config from user preferences - Falls back to placeholder if MCP not configured The implementation now requires users to: 1. Configure dingtalk_docs MCP in user settings 2. Provide the MCP server URL from DingTalk Files modified: - app/mcp_server/tools/dingtalk_docs.py - app/services/dingtalk/docs_service.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
backend/app/mcp_server/tools/dingtalk_docs.py (1)
140-143:⚠️ Potential issue | 🟠 MajorKeep the synchronous document-creation phase off the async handler.
add_dingtalk_doc_to_knowledge()isasync, but it creates a synchronousSessionLocal()and then callsknowledge_orchestrator.create_document_with_content()directly. That blocks the FastMCP event loop, and the obviousasyncio.to_thread(..., db=db)fix would also be unsafe because the current SQLAlchemy session is not thread-safe. Move the whole DB/orchestrator phase behind a worker-thread helper that opens and closes its ownSessionLocal().Also applies to: 205-214
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 140 - 143, The async handler add_dingtalk_doc_to_knowledge is creating a synchronous SessionLocal() and calling knowledge_orchestrator.create_document_with_content directly which blocks the event loop; refactor by extracting the DB/orchestrator work (including opening/closing a fresh SessionLocal and calling _get_user_from_token and knowledge_orchestrator.create_document_with_content) into a synchronous helper function (e.g., sync_create_dingtalk_doc(db_work_helper) that creates its own SessionLocal, performs the user lookup and document creation, closes the session, and returns results) and then call that helper from the async function via asyncio.to_thread(sync_create_dingtalk_doc, ...), ensuring no existing SQLAlchemy SessionLocal instance is shared across threads. Ensure the same change is applied to the other occurrence around lines 205-214.backend/app/services/dingtalk/docs_service.py (2)
261-309:⚠️ Potential issue | 🔴 CriticalDo not fabricate metadata on MCP miss or failure.
The missing-config path, the catch-all error path, and even the Line 286/Line 287 defaults all manufacture a title/timestamp and return them as if DingTalk had confirmed them. That makes
get_document_info()report success with fake metadata and reintroduces the earlier bad-filename/bad-record failure mode.🧭 Patch sketch
mcp_config = self._get_dingtalk_docs_mcp_config(user_preferences) if not mcp_config: - logger.warning( - "dingtalk_docs MCP not configured, returning placeholder info" - ) - # Fallback to placeholder if MCP not configured - now = datetime.now() - return { - "doc_id": doc_id, - "title": f"DingTalkDoc_{doc_id[:8]}", - "modified_time": now.isoformat(), - "modified_time_formatted": now.strftime("%Y%m%d%H%M%S"), - "content_type": "markdown", - "url": doc_url, - } + raise ValueError( + "dingtalk_docs MCP not configured. " + "Please configure DingTalk Docs MCP in user settings." + ) @@ - title = result.get("title", f"DingTalkDoc_{doc_id[:8]}") - modified_time = result.get("modified_time", datetime.now().isoformat()) + title = result.get("title") + modified_time = result.get("modified_time") + if not title or not modified_time: + raise ValueError( + "MCP response missing required document metadata" + ) @@ - except Exception as e: - logger.error(f"Failed to get document info from MCP: {e}") - # Fallback to placeholder on error - now = datetime.now() - return { - "doc_id": doc_id, - "title": f"DingTalkDoc_{doc_id[:8]}", - "modified_time": now.isoformat(), - "modified_time_formatted": now.strftime("%Y%m%d%H%M%S"), - "content_type": "markdown", - "url": doc_url, - } + except Exception as e: + logger.error(f"Failed to get document info from MCP: {e}") + raise ValueError(f"Failed to get document info: {e}") from e
95-113:⚠️ Potential issue | 🟠 MajorValidate the hostname before extracting a document ID.
re.search()scans the full input string, so an unrelated URL or arbitrary text that merely embedsalidocs.dingtalk.com/i/...in a query or fragment will pass validation. Because the originaldoc_urlis then forwarded unchanged to the downstream MCP call, this boundary currently accepts non-DingTalk URLs instead of rejecting them up front.🔒 Patch sketch
+from urllib.parse import urlparse + def _extract_doc_id_from_url(self, url: str) -> Optional[str]: if not url: return None + parsed = urlparse(url) + if parsed.scheme not in {"http", "https"}: + return None + if parsed.hostname != "alidocs.dingtalk.com": + return None + path = parsed.path.rstrip("/") - node_pattern = r"alidocs\.dingtalk\.com/i/nodes/([a-zA-Z0-9_-]+)" - match = re.search(node_pattern, url) + match = re.fullmatch(r"/i/nodes/([a-zA-Z0-9_-]+)", path) if match: return match.group(1) - docs_pattern = r"alidocs\.dingtalk\.com/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)" - match = re.search(docs_pattern, url) + match = re.fullmatch(r"/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)", path) if match: return match.group(1) - wiki_pattern = r"alidocs\.dingtalk\.com/i/team/[^/]+/wiki/([a-zA-Z0-9_-]+)" - match = re.search(wiki_pattern, url) + match = re.fullmatch(r"/i/team/[^/]+/wiki/([a-zA-Z0-9_-]+)", path) if match: return match.group(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 95 - 113, The extraction currently uses re.search over the whole url string (node_pattern, docs_pattern, wiki_pattern) which can match embedded text in unrelated URLs; instead parse the url with urllib.parse.urlparse, validate that the parsed hostname (netloc) is exactly or endswith "alidocs.dingtalk.com" (depending on allowed subdomains), and then run your regexes against only the parsed path (or use path-based pattern matching) to extract and return the document id; update the code that references url, node_pattern, docs_pattern, and wiki_pattern to perform the hostname check before returning any match and return None if the hostname is not the expected DingTalk host.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 195-196: The code builds a timestamped filename via
dingtalk_docs_service.build_filename(title, modified_time) and logs/returns it
but the document creation still calls the create function with name=title and
omits file_extension, so the timestamped "{title}_{timestamp}.md" is never
applied; update the document creation call that currently uses name=title to use
name=filename and set file_extension to ".md" (or the expected extension) so the
built filename is actually used. Locate the variable filename, the
build_filename call, and the create(...) invocation that currently passes
name=title and modify that invocation to pass name=filename and include
file_extension=".md".
---
Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 140-143: The async handler add_dingtalk_doc_to_knowledge is
creating a synchronous SessionLocal() and calling
knowledge_orchestrator.create_document_with_content directly which blocks the
event loop; refactor by extracting the DB/orchestrator work (including
opening/closing a fresh SessionLocal and calling _get_user_from_token and
knowledge_orchestrator.create_document_with_content) into a synchronous helper
function (e.g., sync_create_dingtalk_doc(db_work_helper) that creates its own
SessionLocal, performs the user lookup and document creation, closes the
session, and returns results) and then call that helper from the async function
via asyncio.to_thread(sync_create_dingtalk_doc, ...), ensuring no existing
SQLAlchemy SessionLocal instance is shared across threads. Ensure the same
change is applied to the other occurrence around lines 205-214.
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 95-113: The extraction currently uses re.search over the whole url
string (node_pattern, docs_pattern, wiki_pattern) which can match embedded text
in unrelated URLs; instead parse the url with urllib.parse.urlparse, validate
that the parsed hostname (netloc) is exactly or endswith "alidocs.dingtalk.com"
(depending on allowed subdomains), and then run your regexes against only the
parsed path (or use path-based pattern matching) to extract and return the
document id; update the code that references url, node_pattern, docs_pattern,
and wiki_pattern to perform the hostname check before returning any match and
return None if the hostname is not the expected DingTalk host.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 950c5d32-7fca-4851-9f74-8d16eb86ad0b
📒 Files selected for processing (2)
backend/app/mcp_server/tools/dingtalk_docs.pybackend/app/services/dingtalk/docs_service.py
Critical fixes:
1. Fix _get_document_info_from_mcp() to call real MCP
- Now calls backend MCP endpoint 'get_dingtalk_document_info'
- Returns real document metadata from DingTalk
- Added proper error handling and temp file cleanup
2. Fix _download_document_content_real() to fetch real content
- New method that calls backend MCP endpoint 'download_document'
- Returns actual document content from DingTalk
- Uses unique temp file names with uuid
- Added proper cleanup
3. Fix blocking call in dingtalk_docs.py
- Wrapped knowledge_orchestrator.create_document_with_content
- Now uses asyncio.to_thread() to run off event loop
4. Fix hardcoded temp file path
- Changed from '/tmp/mcp_payload.json' to '/tmp/mcp_payload_{uuid}.json'
- Added try/finally blocks to ensure cleanup
- Applied to all three places using temp files
Files modified:
- app/mcp_server/tools/dingtalk_docs.py
- init_data/skills/dingtalk-docs/provider.py
There was a problem hiding this comment.
🧹 Nitpick comments (3)
backend/init_data/skills/dingtalk-docs/provider.py (3)
861-905: Consider simplifying the redundant cleanup logic.The cleanup code at lines 886-893 runs inside the main
tryblock (aftersandbox.commands.run), but the same cleanup also exists in theexceptblock (lines 897-904). This works but is redundant since a singlefinallyblock would handle both paths.♻️ Simplified structure
try: # Build curl command using array style to prevent shell injection import shlex curl_cmd = [ "curl", "-s", "-X", "POST", "-H", f"Authorization: Bearer {auth_token}", "-H", "Content-Type: application/json", "-d", f"@{payload_file}", mcp_url, ] result_obj = await sandbox.commands.run( cmd=shlex.join(curl_cmd), cwd="/home/user", timeout=60, ) - - # Clean up temp file after use - try: - await sandbox.commands.run( - cmd=f"rm -f {payload_file}", - cwd="/home/user", - timeout=10, - ) - except Exception: - pass - - except Exception as e: - # Clean up temp file on error + finally: try: await sandbox.commands.run( cmd=f"rm -f {payload_file}", cwd="/home/user", timeout=10, ) except Exception: pass - raise e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 861 - 905, The temp-file cleanup for payload_file is duplicated in both the success path and exception path; refactor the try/except to a try/finally around the sandbox.commands.run invocation that executes the rm -f {payload_file} cleanup (called via sandbox.commands.run) in the finally block, remove the duplicate cleanup in the except branch, and keep error re-raising behavior for exceptions from sandbox.commands.run; use the existing curl_cmd/shlex.join, auth_token and mcp_url variables as before to locate the code to change.
74-77: Useraise ... from Noneto distinguish from exception handling errors.When re-raising an
ImportErrorwithin anexceptclause, usefrom Noneto suppress the exception chain and clarify intent.♻️ Proposed fix
else: - raise ImportError( + raise ImportError( "Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base" - ) + ) from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 74 - 77, The ImportError raised in the else branch that references "Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base" should be suppressed from chaining by using "raise ... from None"; locate the else block in provider.py where the ImportError is raised (the message about BaseSandboxTool and chat_shell.tools.sandbox._base) and modify the raise to use "from None" so the exception chain is not preserved.
548-592: Remove dead placeholder code.This
_download_document_contentmethod returns placeholder content and is never called—the workflow at line 303 uses_download_document_content_realinstead. Remove this dead code to avoid confusion.🗑️ Remove dead code
- async def _download_document_content(self, sandbox: Any, doc_url: str) -> dict: - """Download document content from DingTalk. - - For now, this returns placeholder content. - In production, this would call the DingTalk API via MCP. - """ - try: - # Extract doc ID for content generation - import re - - doc_id = None - patterns = [ - r"alidocs\.dingtalk\.com/i/nodes/([a-zA-Z0-9_-]+)", - r"alidocs\.dingtalk\.com/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)", - ] - - for pattern in patterns: - match = re.search(pattern, doc_url) - if match: - doc_id = match.group(1) - break - - # Generate placeholder content - content = f"""# DingTalk Document - -This document was imported from DingTalk. - -**Source URL:** {doc_url} -**Document ID:** {doc_id or "unknown"} -**Imported at:** {datetime.now().isoformat()} - -## Content - -The actual content would be fetched from DingTalk API in production. -This is a placeholder for the document content. - ---- -*Imported by Wegent DingTalk Docs Skill* -""" - - return {"success": True, "content": content} - - except Exception as e: - return {"success": False, "error": str(e)}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-docs/provider.py` around lines 548 - 592, Remove the dead placeholder method _download_document_content (its docstring, body and exception handling) from the DingTalk docs provider since the workflow uses _download_document_content_real; also remove any now-unused imports referenced only by that method (e.g., re, datetime) and ensure there are no remaining references to _download_document_content elsewhere in the module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/init_data/skills/dingtalk-docs/provider.py`:
- Around line 861-905: The temp-file cleanup for payload_file is duplicated in
both the success path and exception path; refactor the try/except to a
try/finally around the sandbox.commands.run invocation that executes the rm -f
{payload_file} cleanup (called via sandbox.commands.run) in the finally block,
remove the duplicate cleanup in the except branch, and keep error re-raising
behavior for exceptions from sandbox.commands.run; use the existing
curl_cmd/shlex.join, auth_token and mcp_url variables as before to locate the
code to change.
- Around line 74-77: The ImportError raised in the else branch that references
"Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base" should be
suppressed from chaining by using "raise ... from None"; locate the else block
in provider.py where the ImportError is raised (the message about
BaseSandboxTool and chat_shell.tools.sandbox._base) and modify the raise to use
"from None" so the exception chain is not preserved.
- Around line 548-592: Remove the dead placeholder method
_download_document_content (its docstring, body and exception handling) from the
DingTalk docs provider since the workflow uses _download_document_content_real;
also remove any now-unused imports referenced only by that method (e.g., re,
datetime) and ensure there are no remaining references to
_download_document_content elsewhere in the module.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 027c1883-d4aa-4e6e-87e0-6a556953a71b
📒 Files selected for processing (2)
backend/app/mcp_server/tools/dingtalk_docs.pybackend/init_data/skills/dingtalk-docs/provider.py
Add better error handling and user-friendly messages when DingTalk document authentication fails: 1. Detect HTTP 401/403 errors from DingTalk MCP 2. Check for authentication-related keywords in error messages 3. Provide clear instructions to users: - Ensure they have access to the document in DingTalk - Check if document is shared or public - Verify MCP configuration is correct This helps users understand why document download failed and what steps they need to take to fix it. File modified: - app/services/dingtalk/docs_service.py
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
backend/app/services/dingtalk/docs_service.py (2)
191-225: Extract repeated numeric literals into named constants.The literals at Line 209 (
60.0), Line 219/413 (401,403), and Line 392 (doc_id[:8]) should be constants for maintainability.As per coding guidelines, "Extract magic numbers to named constants in Python code."
Also applies to: 392-405
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 191 - 225, Replace magic numeric literals with descriptive constants: define DEFAULT_HTTP_TIMEOUT = 60.0 and use it when creating httpx.AsyncClient(timeout=DEFAULT_HTTP_TIMEOUT); define AUTH_ERROR_CODES = (401, 403) and use it in the HTTPStatusError handling instead of hard-coded 401/403; define DOC_ID_PREVIEW_LEN = 8 and use it wherever doc_id[:8] is used. Update the references in the MCP call function (the payload/call block and the except httpx.HTTPStatusError handler) and in the code that slices doc_id so all occurrences use these named constants (e.g., in the function that constructs doc previews or logs). Ensure constants are placed near the module top with descriptive names and used throughout.
74-76: Drop the no-op constructor or add explicit return type annotation.Line 74-76 defines an empty
__init__without-> None; either remove it or annotate it explicitly.As per coding guidelines, "Python code MUST follow PEP 8, Black formatter (line length: 88), and isort standards with type hints required."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 74 - 76, The empty constructor def __init__(self): in docs_service.py is a no-op and missing a return type; either remove this method entirely if it adds no behavior, or annotate it with an explicit -> None return type (i.e., change to def __init__(self) -> None:) to satisfy type-hinting rules and PEP8/Black requirements; update the __init__ declaration accordingly in the class that defines it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 289-337: The code currently masks MCP/config/auth failures by
returning fabricated metadata in get_document_info flow; instead, when
_get_dingtalk_docs_mcp_config(user_preferences) returns falsy, raise a clear
exception (e.g., RuntimeError or a DingtalkConfigError) with context about
missing MCP config rather than returning a placeholder, and in the except block
around _call_dingtalk_mcp_tool re-raise the original exception (or wrap it with
additional context) instead of returning placeholder metadata; update references
in this function to surface errors from _get_dingtalk_docs_mcp_config and
_call_dingtalk_mcp_tool so callers can handle auth/config failures (include the
original exception message when re-raising).
- Around line 343-405: The export_format parameter is not validated before
calling the MCP and is later used directly as file_extension; update the method
handling doc download to validate export_format against allowed values (e.g.,
"markdown", "html", "txt") before calling _call_dingtalk_mcp_tool, map
"markdown" to "md" for file_extension, and raise a clear ValueError for
unsupported formats; locate this logic in the same function (the one that calls
_call_dingtalk_mcp_tool and uses _format_modified_time) and perform
validation/mapping right after extracting doc_id and before building the MCP
arguments.
- Around line 414-422: The two ValueError raises that handle
httpx.HTTPStatusError should chain the original exception (use "raise
ValueError(...) from e") and the multi-line messages that use f-strings without
interpolation should be plain string literals; update the raises referencing the
caught exception variable "e" so both raise ValueError(...) from e and replace
f"... " with regular "..." for the static messages (the ValueError lines that
mention DingTalk authentication and the one that includes HTTP
{e.response.status_code}).
---
Nitpick comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 191-225: Replace magic numeric literals with descriptive
constants: define DEFAULT_HTTP_TIMEOUT = 60.0 and use it when creating
httpx.AsyncClient(timeout=DEFAULT_HTTP_TIMEOUT); define AUTH_ERROR_CODES = (401,
403) and use it in the HTTPStatusError handling instead of hard-coded 401/403;
define DOC_ID_PREVIEW_LEN = 8 and use it wherever doc_id[:8] is used. Update the
references in the MCP call function (the payload/call block and the except
httpx.HTTPStatusError handler) and in the code that slices doc_id so all
occurrences use these named constants (e.g., in the function that constructs doc
previews or logs). Ensure constants are placed near the module top with
descriptive names and used throughout.
- Around line 74-76: The empty constructor def __init__(self): in
docs_service.py is a no-op and missing a return type; either remove this method
entirely if it adds no behavior, or annotate it with an explicit -> None return
type (i.e., change to def __init__(self) -> None:) to satisfy type-hinting rules
and PEP8/Black requirements; update the __init__ declaration accordingly in the
class that defines it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dcaed037-be49-4cbc-ba8b-ecee32782f09
📒 Files selected for processing (1)
backend/app/services/dingtalk/docs_service.py
| export_format: str = "markdown", | ||
| ) -> Dict[str, Any]: | ||
| """Download DingTalk document content via MCP. | ||
|
|
||
| Args: | ||
| doc_url: DingTalk document URL | ||
| user_preferences: User's preferences JSON string (for MCP config) | ||
| export_format: Export format (markdown, html, txt) | ||
|
|
||
| Returns: | ||
| Dict containing: | ||
| - content: Document content as string | ||
| - title: Document title | ||
| - modified_time: Modification time | ||
| - file_extension: Suggested file extension | ||
|
|
||
| Raises: | ||
| ValueError: If download fails | ||
| """ | ||
| doc_id = self._extract_doc_id_from_url(doc_url) | ||
| if not doc_id: | ||
| raise ValueError(f"Invalid DingTalk document URL: {doc_url}") | ||
|
|
||
| logger.info(f"Downloading document content for doc_id: {doc_id}") | ||
|
|
||
| # Get MCP config | ||
| mcp_config = self._get_dingtalk_docs_mcp_config(user_preferences) | ||
| if not mcp_config: | ||
| raise ValueError( | ||
| "dingtalk_docs MCP not configured. " | ||
| "Please configure DingTalk Docs MCP in user settings." | ||
| ) | ||
|
|
||
| # Call MCP tool to download document content | ||
| try: | ||
| result = await self._call_dingtalk_mcp_tool( | ||
| mcp_config, | ||
| tool_name="download_document", | ||
| arguments={ | ||
| "doc_id": doc_id, | ||
| "url": doc_url, | ||
| "format": export_format, | ||
| }, | ||
| ) | ||
|
|
||
| content = result.get("content", "") | ||
| if not content: | ||
| raise ValueError("Empty content returned from MCP") | ||
|
|
||
| title = result.get("title", f"DingTalkDoc_{doc_id[:8]}") | ||
| modified_time = result.get( | ||
| "modified_time", | ||
| result.get("last_modified", datetime.now().isoformat()), | ||
| ) | ||
|
|
||
| return { | ||
| "content": content, | ||
| "title": title, | ||
| "modified_time": modified_time, | ||
| "modified_time_formatted": self._format_modified_time(modified_time), | ||
| "file_extension": ( | ||
| "md" if export_format == "markdown" else export_format | ||
| ), |
There was a problem hiding this comment.
Validate export_format against supported values before MCP call.
Line 343 accepts any string; unsupported values are forwarded downstream and also used as file_extension on Line 403-405.
💡 Suggested fix
+SUPPORTED_EXPORT_FORMATS = {"markdown", "html", "txt"}
@@
async def download_document_content(
self,
doc_url: str,
user_preferences: Optional[str] = None,
export_format: str = "markdown",
@@
+ if export_format not in SUPPORTED_EXPORT_FORMATS:
+ raise ValueError(
+ f"Unsupported export_format: {export_format}. "
+ f"Supported formats: {', '.join(sorted(SUPPORTED_EXPORT_FORMATS))}"
+ )
+
doc_id = self._extract_doc_id_from_url(doc_url)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/services/dingtalk/docs_service.py` around lines 343 - 405, The
export_format parameter is not validated before calling the MCP and is later
used directly as file_extension; update the method handling doc download to
validate export_format against allowed values (e.g., "markdown", "html", "txt")
before calling _call_dingtalk_mcp_tool, map "markdown" to "md" for
file_extension, and raise a clear ValueError for unsupported formats; locate
this logic in the same function (the one that calls _call_dingtalk_mcp_tool and
uses _format_modified_time) and perform validation/mapping right after
extracting doc_id and before building the MCP arguments.
| raise ValueError( | ||
| f"DingTalk authentication required. Please ensure:\n" | ||
| f"1. You have access to this document in DingTalk\n" | ||
| f"2. The document is shared with you or is public\n" | ||
| f"3. Your DingTalk MCP configuration is correct" | ||
| ) | ||
| raise ValueError( | ||
| f"Failed to download document from DingTalk: HTTP {e.response.status_code}" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
uv run ruff check backend/app/services/dingtalk/docs_service.py --select F541,B904 -nRepository: wecode-ai/Wegent
Length of output: 102
🏁 Script executed:
#!/bin/bash
# First, verify the file exists and check line count
wc -l backend/app/services/dingtalk/docs_service.py
# Read lines 410-425 to get full context
sed -n '410,425p' backend/app/services/dingtalk/docs_service.pyRepository: wecode-ai/Wegent
Length of output: 932
🏁 Script executed:
#!/bin/bash
# Read lines 400-430 to see the full try-except context
sed -n '400,430p' backend/app/services/dingtalk/docs_service.pyRepository: wecode-ai/Wegent
Length of output: 1561
Chain exceptions and replace unnecessary f-strings with regular strings.
Lines 415-418 use f-strings without any variable placeholders and should be regular strings. Additionally, both ValueError raises at lines 414 and 420 should chain the httpx.HTTPStatusError exception with from e to preserve the original traceback context:
raise ValueError(...) from e
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 414-419: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
[error] 415-415: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 416-416: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 417-417: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 418-418: f-string without any placeholders
Remove extraneous f prefix
(F541)
[warning] 420-422: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/services/dingtalk/docs_service.py` around lines 414 - 422, The
two ValueError raises that handle httpx.HTTPStatusError should chain the
original exception (use "raise ValueError(...) from e") and the multi-line
messages that use f-strings without interpolation should be plain string
literals; update the raises referencing the caught exception variable "e" so
both raise ValueError(...) from e and replace f"... " with regular "..." for the
static messages (the ValueError lines that mention DingTalk authentication and
the one that includes HTTP {e.response.status_code}).
- Fix get_document_info to use correct nodeId parameter instead of doc_id/url
- Rewrite download_document_content to follow DingTalk MCP workflow:
1. Call get_document_info to get contentType and extension
2. Route to appropriate tool based on document type:
- adoc: use get_document_content
- axls: raise error (requires dingtalk_table MCP)
- able: raise error (requires dingtalk_ai_table MCP)
- file: use download_file and fetch from downloadUrl
- Add proper error handling for unsupported document types
- Add urlparse validation to _extract_doc_id_from_url for proper hostname checking - Extract magic numbers to constants (HTTP_TIMEOUT_SECONDS, AUTH_ERROR_CODES, DOC_ID_PREVIEW_LENGTH) - Add return type annotation to __init__ method - Remove placeholder fallback in get_document_info, raise proper exceptions instead - Apply consistent black formatting Addresses CodeRabbit review feedback for improved security and maintainability.
- Add Accept: application/json header to MCP requests to fix HTTP 406 errors - Add response body logging for better debugging - Add specific error handling for 406 Not Acceptable with detailed message The 406 error typically indicates content negotiation issues between the client and DingTalk MCP gateway server.
- Fix get_document_info to use camelCase field names (modifiedTime, contentType) as returned by DingTalk MCP instead of snake_case - Add debug logging to show available fields in MCP response - Add fallback field names for get_document_content (markdown, text) in case content field is empty The DingTalk MCP returns camelCase field names, but the code was expecting snake_case names causing metadata extraction to fail.
DingTalk MCP get_document_info returns different field names than expected: - 'name' instead of 'title' for document name - 'updateTime' instead of 'modifiedTime' for modification time - 'docUrl' for the document URL Updated the field mapping to match actual DingTalk MCP response format.
DingTalk MCP returns updateTime as Unix timestamp in milliseconds (13 digits). Updated _format_modified_time to: - Detect millisecond timestamps (13 digit numbers) - Use the original timestamp directly in filename (e.g., 产品需求文档_1775705225000) - Fall back to YYYYMMDDHHMMSS format for ISO datetime strings This matches the user's requested filename format.
- Fix doc_info.get to use 'name' instead of 'title' for document title - Fix doc_info.get to use 'updateTime' instead of 'modifiedTime' for timestamp - Update build_dingtalk_doc_filename docstring to match actual behavior - Example now shows Unix timestamp format: 产品需求文档_1775705225000.md - Args description updated to reflect timestamp format This ensures correct filename generation when downloading documents.
Update modified_time format validation to accept both: - Unix timestamp in milliseconds (13 digits) - YYYYMMDDHHMMSS format (14 digits) Previously only accepted 14-digit format, causing warnings when DingTalk MCP returns 13-digit Unix timestamps.
|
@coderabbitai resume |
✅ Actions performedReviews resumed. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
backend/app/services/dingtalk/docs_service.py (1)
507-529:⚠️ Potential issue | 🟡 MinorRemove unnecessary f-string prefixes and chain exceptions.
Lines 513-516 use f-strings without any placeholders. Additionally, the
ValueErrorraises should chain the original exception withfrom eto preserve traceback context.🔧 Proposed fix
except httpx.HTTPStatusError as e: logger.error( f"HTTP error from DingTalk MCP: {e.response.status_code} - {e.response.text}" ) if e.response.status_code in AUTH_ERROR_CODES: raise ValueError( - f"DingTalk authentication required. Please ensure:\n" - f"1. You have access to this document in DingTalk\n" - f"2. The document is shared with you or is public\n" - f"3. Your DingTalk MCP configuration is correct" - ) + "DingTalk authentication required. Please ensure:\n" + "1. You have access to this document in DingTalk\n" + "2. The document is shared with you or is public\n" + "3. Your DingTalk MCP configuration is correct" + ) from e raise ValueError( f"Failed to download document from DingTalk: HTTP {e.response.status_code}" - ) + ) from e except Exception as e: logger.error(f"Failed to download document from MCP: {e}") error_msg = str(e) if "authentication" in error_msg.lower() or "auth" in error_msg.lower(): raise ValueError( f"DingTalk authentication required: {error_msg}\n" - f"Please ensure you have access to this document in DingTalk." - ) - raise ValueError(f"Failed to download document: {e}") + "Please ensure you have access to this document in DingTalk." + ) from e + raise ValueError(f"Failed to download document: {e}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 507 - 529, In the DingTalk docs service exception handlers (the except httpx.HTTPStatusError and generic except Exception blocks in docs_service.py), remove the unnecessary f-string prefixes from string literals (e.g., logger.error and ValueError messages that have no placeholders) and change each bare "raise ValueError(...)" to chain the original exception using "raise ValueError(... ) from e" so the traceback/context is preserved; reference the logger, AUTH_ERROR_CODES, and the caught exception variable e when making these edits.backend/app/mcp_server/tools/dingtalk_docs.py (1)
195-217:⚠️ Potential issue | 🟠 MajorThe timestamped filename is not being used in document creation.
The filename is built at line 196 following the
{title}_{timestamp}.mdconvention, but the document is created withname=title(line 212) instead ofname=filename. The filename only appears in the response payload (line 223) but the actual document doesn't use it.🔧 Proposed fix
result = await asyncio.to_thread( knowledge_orchestrator.create_document_with_content, db=db, user=user, knowledge_base_id=knowledge_base_id, - name=title, + name=filename, source_type="text", content=doc_content, + file_extension=".md", trigger_indexing=trigger_indexing, trigger_summary=trigger_summary, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 195 - 217, The created document currently uses name=title instead of the timestamped filename produced by dingtalk_docs_service.build_filename; update the call to knowledge_orchestrator.create_document_with_content to pass the generated filename (variable filename) as the name argument (instead of title) so the stored document uses the `{title}_{timestamp}.md` name returned in the response; ensure you reference the existing variables filename and title and the function knowledge_orchestrator.create_document_with_content when making this change.
🧹 Nitpick comments (1)
backend/app/services/dingtalk/docs_service.py (1)
364-369: Theexport_formatparameter is unused.The
export_formatparameter is documented and accepted but never used in the implementation. The actual download format is determined by the document'scontentTypeandextensionfrom the MCP response. Consider either removing the unused parameter or implementing format-based export logic if needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 364 - 369, The download_document_content function currently accepts an export_format parameter but never uses it; either remove the unused parameter or implement logic to respect it: update download_document_content to inspect export_format and map it to the appropriate request/conversion path (e.g., force markdown/pdf/html export when export_format == "markdown"/"pdf"/"html"), or if conversion isn't supported, remove export_format from the signature and all callers; reference the download_document_content function and its export_format parameter when making the change and ensure any tests/call sites are updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 248-262: The except block catching httpx.HTTPStatusError (the
handler using "except httpx.HTTPStatusError as e") currently raises several
ValueError instances without chaining; update each raise (the ones for
AUTH_ERROR_CODES, status 406, and the generic failure in the DingTalk MCP call)
to include exception chaining by appending "from e" so the original
httpx.HTTPStatusError traceback is preserved for debugging while keeping the
existing log calls (logger.error) intact.
---
Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 195-217: The created document currently uses name=title instead of
the timestamped filename produced by dingtalk_docs_service.build_filename;
update the call to knowledge_orchestrator.create_document_with_content to pass
the generated filename (variable filename) as the name argument (instead of
title) so the stored document uses the `{title}_{timestamp}.md` name returned in
the response; ensure you reference the existing variables filename and title and
the function knowledge_orchestrator.create_document_with_content when making
this change.
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 507-529: In the DingTalk docs service exception handlers (the
except httpx.HTTPStatusError and generic except Exception blocks in
docs_service.py), remove the unnecessary f-string prefixes from string literals
(e.g., logger.error and ValueError messages that have no placeholders) and
change each bare "raise ValueError(...)" to chain the original exception using
"raise ValueError(... ) from e" so the traceback/context is preserved; reference
the logger, AUTH_ERROR_CODES, and the caught exception variable e when making
these edits.
---
Nitpick comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 364-369: The download_document_content function currently accepts
an export_format parameter but never uses it; either remove the unused parameter
or implement logic to respect it: update download_document_content to inspect
export_format and map it to the appropriate request/conversion path (e.g., force
markdown/pdf/html export when export_format == "markdown"/"pdf"/"html"), or if
conversion isn't supported, remove export_format from the signature and all
callers; reference the download_document_content function and its export_format
parameter when making the change and ensure any tests/call sites are updated
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4f41b824-1638-4c3d-8079-5c55d802c848
📒 Files selected for processing (2)
backend/app/mcp_server/tools/dingtalk_docs.pybackend/app/services/dingtalk/docs_service.py
| except httpx.HTTPStatusError as e: | ||
| logger.error(f"HTTP error from DingTalk MCP: {e.response.status_code}") | ||
| logger.error(f"Response body: {e.response.text}") | ||
| if e.response.status_code in AUTH_ERROR_CODES: | ||
| raise ValueError( | ||
| "DingTalk authentication failed. Please check your MCP configuration." | ||
| ) | ||
| if e.response.status_code == 406: | ||
| raise ValueError( | ||
| f"DingTalk MCP request format not acceptable (HTTP 406). " | ||
| f"Response: {e.response.text}" | ||
| ) | ||
| raise ValueError( | ||
| f"Failed to call DingTalk MCP: HTTP {e.response.status_code}" | ||
| ) |
There was a problem hiding this comment.
Chain exceptions to preserve traceback context.
The ValueError raises in the exception handler should chain the original exception using from e to preserve the traceback for debugging.
🔧 Proposed fix
except httpx.HTTPStatusError as e:
logger.error(f"HTTP error from DingTalk MCP: {e.response.status_code}")
logger.error(f"Response body: {e.response.text}")
if e.response.status_code in AUTH_ERROR_CODES:
raise ValueError(
"DingTalk authentication failed. Please check your MCP configuration."
- )
+ ) from e
if e.response.status_code == 406:
raise ValueError(
f"DingTalk MCP request format not acceptable (HTTP 406). "
f"Response: {e.response.text}"
- )
+ ) from e
raise ValueError(
f"Failed to call DingTalk MCP: HTTP {e.response.status_code}"
- )
+ ) from e🧰 Tools
🪛 Ruff (0.15.9)
[warning] 252-254: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
[warning] 256-259: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
[warning] 260-262: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/services/dingtalk/docs_service.py` around lines 248 - 262, The
except block catching httpx.HTTPStatusError (the handler using "except
httpx.HTTPStatusError as e") currently raises several ValueError instances
without chaining; update each raise (the ones for AUTH_ERROR_CODES, status 406,
and the generic failure in the DingTalk MCP call) to include exception chaining
by appending "from e" so the original httpx.HTTPStatusError traceback is
preserved for debugging while keeping the existing log calls (logger.error)
intact.
…ctor Rename the skill to avoid conflicts with other definitions: - Folder: dingtalk-docs → dingtalk-connector - Skill name: dingtalk-docs → dingtalk-connector - Provider name: dingtalk-docs → dingtalk-connector - Task type: dingtalk-docs → dingtalk-connector Updated all references in: - SKILL.md - provider.py - mcp_provider_registry.py - dingtalk_docs.py - Test files
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
backend/app/mcp_server/tools/dingtalk_docs.py (1)
195-217:⚠️ Potential issue | 🟠 MajorThe generated
{title}_{timestamp}.mdname is still never applied.
filenameis built and returned, butcreate_document_with_content()still receivesname=title. The text-import path therefore does not actually persist the naming convention this tool advertises.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 195 - 217, The code builds a filename via dingtalk_docs_service.build_filename (assigned to filename) but still passes the original title into knowledge_orchestrator.create_document_with_content (name=title), so the generated `{title}_{timestamp}.md` never gets persisted; update the call to create_document_with_content to pass name=filename (and any other filename-related parameter the function supports) instead of title so the stored document uses the built filename from filename.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 33-41: Add OpenTelemetry tracing decorators to the MCP handlers so
their import path appears in traces: decorate the async handler
get_dingtalk_document_info with `@trace_async` using a descriptive span name
(e.g., "mcp.get_dingtalk_document_info"), the appropriate tracer name (e.g.,
"mcp") and the existing extract_attributes callable; likewise apply `@trace_async`
or `@trace_sync` (matching each function's async/sync nature) to the other
handlers defined in the file around the blocks at 90-104 and 237-249, and ensure
the trace decorator is imported where these functions are defined.
In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 603-618: The download step in _download_document_content_real()
fails to extract IDs for valid DingTalk wiki links even though
_get_document_info_from_mcp() accepts them; update the URL extraction logic used
in _download_document_content_real() (the local variable doc_url and the
patterns list) to include the wiki form accepted upstream (e.g., add a regex for
/wiki/... that captures the document ID) or normalize wiki URLs into the
nodes/docs form before running the existing patterns so that the same doc_id is
produced for wiki links and the "Could not extract document ID" error no longer
occurs.
- Around line 224-232: Add OpenTelemetry tracing to the async entrypoint by
decorating the _arun method with `@trace_async`; e.g., import trace_async and add
a decorator like `@trace_async`("dingtalk.import", "dingtalk-connector", lambda
ctx, args, kwargs: {"dingtalk_doc_url": kwargs.get("dingtalk_doc_url", args[0]
if args else None), "knowledge_base_id": kwargs.get("knowledge_base_id", args[1]
if len(args) > 1 else None)}). Apply this to the async def _arun(...) function
so the whole async flow emits a top-level span and include the import for
trace_async at the top of the module.
- Around line 635-645: The code builds an MCP payload named "download_document"
(payload variable) which is not a registered tool and will fail at runtime;
replace the "name": "download_document" entry with
"add_dingtalk_doc_to_knowledge" and adjust the arguments to match that tool
(pass doc_id/doc_url and any required metadata/format), or alternatively
register/implement a new MCP tool named "download_document" that performs the
standalone download; refer to the existing DingTalk tools
get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and
add_dingtalk_doc_with_attachment to align argument names and behavior.
---
Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 195-217: The code builds a filename via
dingtalk_docs_service.build_filename (assigned to filename) but still passes the
original title into knowledge_orchestrator.create_document_with_content
(name=title), so the generated `{title}_{timestamp}.md` never gets persisted;
update the call to create_document_with_content to pass name=filename (and any
other filename-related parameter the function supports) instead of title so the
stored document uses the built filename from filename.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: da173f5d-960b-4536-ad5d-3436cb712069
📒 Files selected for processing (7)
backend/app/mcp_server/tools/dingtalk_docs.pybackend/app/services/mcp_provider_registry.pybackend/init_data/skills/dingtalk-connector/SKILL.mdbackend/init_data/skills/dingtalk-connector/__init__.pybackend/init_data/skills/dingtalk-connector/provider.pybackend/tests/services/execution/test_request_builder_mcp.pybackend/tests/services/execution/test_request_builder_user_mcp.py
✅ Files skipped from review due to trivial changes (2)
- backend/init_data/skills/dingtalk-connector/init.py
- backend/init_data/skills/dingtalk-connector/SKILL.md
| # Build MCP tool call payload for download_document | ||
| payload = { | ||
| "jsonrpc": "2.0", | ||
| "method": "tools/call", | ||
| "params": { | ||
| "name": "download_document", | ||
| "arguments": { | ||
| "doc_id": doc_id, | ||
| "url": doc_url, | ||
| "format": "markdown", | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
echo "Searching for MCP tools named download_document:"
rg -n --type=py 'name\s*=\s*"download_document"' backend/app/mcp_server/tools -C2 || true
echo
echo "Searching for knowledge-server DingTalk tools introduced in this PR:"
rg -n --type=py 'name\s*=\s*"get_dingtalk_document_info"|name\s*=\s*"add_dingtalk_doc_to_knowledge"|name\s*=\s*"add_dingtalk_doc_with_attachment"|build_mcp_tools_dict\(server="knowledge"\)' backend/app/mcp_server/tools backend/app/mcp_server/server.py -C1Repository: wecode-ai/Wegent
Length of output: 1654
🏁 Script executed:
#!/bin/bash
echo "=== Search for download_document across entire codebase ==="
rg -n 'download_document' --type=py . 2>/dev/null | head -20
echo
echo "=== Examine context in provider.py around lines 635-645 ==="
cat -n backend/init_data/skills/dingtalk-connector/provider.py | sed -n '625,650p'Repository: wecode-ai/Wegent
Length of output: 2095
🏁 Script executed:
#!/bin/bash
echo "=== Check for dynamic tool registration or aliases ==="
rg -n 'download_document|dynamic.*tool|register.*tool' backend/app/mcp_server/ --type=py | grep -i 'dynamic\|alias\|register' | head -10
echo
echo "=== Check add_dingtalk_doc_to_knowledge implementation ==="
sed -n '90,120p' backend/app/mcp_server/tools/dingtalk_docs.pyRepository: wecode-ai/Wegent
Length of output: 2668
download_document MCP tool does not exist and will cause runtime failure.
The code posts name: "download_document" to /mcp/knowledge/sse, but this tool is not registered. The only DingTalk-related knowledge tools available are get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and add_dingtalk_doc_with_attachment. The workflow will fail when this code path executes.
Consider using add_dingtalk_doc_to_knowledge (which handles document downloads) or create the missing download_document MCP tool if a standalone download operation is needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 635 -
645, The code builds an MCP payload named "download_document" (payload variable)
which is not a registered tool and will fail at runtime; replace the "name":
"download_document" entry with "add_dingtalk_doc_to_knowledge" and adjust the
arguments to match that tool (pass doc_id/doc_url and any required
metadata/format), or alternatively register/implement a new MCP tool named
"download_document" that performs the standalone download; refer to the existing
DingTalk tools get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and
add_dingtalk_doc_with_attachment to align argument names and behavior.
…uments
1. Add source_config parameter to create_document_with_content:
- Accepts optional source_config dict
- Passes to KnowledgeDocumentCreate
2. Update add_dingtalk_doc_to_knowledge to:
- Extract file_extension from DingTalk response (default: md)
- Build source_config with url, source, and updated_at
- Pass file_extension and source_config to document creation
source_config format:
{
"url": "https://alidocs.dingtalk.com/i/nodes/...",
"source": "dingtalk-connector",
"updated_at": 1775705225000
}
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/knowledge/orchestrator.py (1)
1093-1109:⚠️ Potential issue | 🟠 MajorAdd tracing decorator to the modified sync orchestrator entrypoint.
create_document_with_content(...)was modified but still has no@trace_sync(...), so this new DingTalk ingestion path remains untraced in OTel.As per coding guidelines, "Use
@trace_sync(span_name, tracer_name, extract_attributes)decorator to trace entire sync functions in OpenTelemetry".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/knowledge/orchestrator.py` around lines 1093 - 1109, The create_document_with_content sync entrypoint is missing the OpenTelemetry tracing decorator; add `@trace_sync`(...) above the create_document_with_content method, importing trace_sync if not already present, and configure span_name (e.g., "knowledge.create_document_with_content"), tracer_name (matching existing service tracer), and an extract_attributes callable that captures key attributes such as knowledge_base_id, user.id, source_type, and trigger_indexing so the entire ingestion path is traced; ensure the decorator is applied to the method signature in orchestrator.py (create_document_with_content).
♻️ Duplicate comments (2)
backend/app/mcp_server/tools/dingtalk_docs.py (2)
203-204:⚠️ Potential issue | 🟠 MajorTimestamped filename is computed but not used in persistence path.
Line 204 builds
{title}_{timestamp}.md, but Line 227 still passesname=title, so the actual stored/uploaded name does not follow the advertised convention.Suggested fix
# Build filename according to naming convention filename = dingtalk_docs_service.build_filename(title, modified_time) + name_without_ext, ext = filename.rsplit(".", 1) @@ result = await asyncio.to_thread( knowledge_orchestrator.create_document_with_content, db=db, user=user, knowledge_base_id=knowledge_base_id, - name=title, + name=name_without_ext, source_type="text", content=doc_content, - file_extension=file_extension, + file_extension=ext, trigger_indexing=trigger_indexing, trigger_summary=trigger_summary, source_config=source_config, )Also applies to: 222-234
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 203 - 204, The timestamped filename created by dingtalk_docs_service.build_filename(title, modified_time) is computed into the variable filename but never used; update the persistence/upload call that currently passes name=title to instead pass name=filename (i.e., use the filename variable), ensuring the stored/uploaded document follows the {title}_{timestamp}.md convention; look for usages around the call that sets filename and the subsequent persist/upload function where name=title and replace that argument.
33-41:⚠️ Potential issue | 🟠 MajorTrace decorators are still missing on all newly added MCP handlers.
get_dingtalk_document_info,add_dingtalk_doc_to_knowledge, andadd_dingtalk_doc_with_attachmentshould be wrapped with@trace_async/@trace_syncto make this flow observable.As per coding guidelines, "Use
@trace_async(span_name, tracer_name, extract_attributes)decorator to trace entire async functions in OpenTelemetry" and "Use@trace_sync(span_name, tracer_name, extract_attributes)decorator to trace entire sync functions in OpenTelemetry".Also applies to: 90-104, 254-266
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 33 - 41, The three newly added MCP handler functions get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and add_dingtalk_doc_with_attachment need OpenTelemetry decorators; wrap each async handler with `@trace_async`(span_name, tracer_name, extract_attributes) and any sync handler with `@trace_sync`(...) so their execution is observable: add `@trace_async` for get_dingtalk_document_info and for any other async functions (e.g., add_dingtalk_doc_with_attachment if async), and use `@trace_sync` for synchronous handlers (e.g., add_dingtalk_doc_to_knowledge if sync), providing a clear span_name (e.g., "mcp.get_dingtalk_document_info"), the appropriate tracer_name, and an extract_attributes lambda that pulls relevant params (like doc_url, node_id) to ensure consistent tracing across these handlers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 317-323: The response currently returns the input attachment_id
which can be stale because the orchestrator
(backend/app/services/knowledge/orchestrator.py) copies the attachment and links
the document to a new attachment; update the return payload in the function that
constructs the response (the block using result, doc_title and attachment_id in
backend/app/mcp_server/tools/dingtalk_docs.py) to return the actual linked
attachment id from the created document (use result.attachment_id) instead of
the original input attachment_id; also ensure result has an attachment_id
attribute before returning (or fall back appropriately).
- Around line 222-234: You’re passing the SQLAlchemy Session object `db` and ORM
`user` into `asyncio.to_thread` which is unsafe; instead either (A) open/close a
new DB session inside the worker thread and pass only primitives, or (B) avoid
moving DB/ORM work into a thread by calling `await
knowledge_orchestrator.create_document_with_content(...)` directly. Concretely:
change the call site that uses
`asyncio.to_thread(knowledge_orchestrator.create_document_with_content, db=db,
user=user, ...)` to pass only primitive IDs/strings (e.g. `user_id=user.id`) and
have `knowledge_orchestrator.create_document_with_content` accept `user_id` and
create its own session inside the thread, or refactor
`create_document_with_content` to be async and call it without `to_thread`; do
not pass `db` or ORM `user` across the thread boundary.
In `@backend/app/services/knowledge/orchestrator.py`:
- Line 1108: The attachment import branch in orchestrator.py drops the
source_config parameter when constructing the KnowledgeDocumentCreate for
source_type="attachment", so metadata is lost; update the attachment flow where
KnowledgeDocumentCreate is built (the block that creates KnowledgeDocumentCreate
in the attachment branch) to accept and pass through the source_config
Optional[Dict[str, Any]] exactly like the non-attachment path (the later
persistence path that already includes source_config), ensuring the constructor
call includes source_config and any downstream calls persist it unchanged.
---
Outside diff comments:
In `@backend/app/services/knowledge/orchestrator.py`:
- Around line 1093-1109: The create_document_with_content sync entrypoint is
missing the OpenTelemetry tracing decorator; add `@trace_sync`(...) above the
create_document_with_content method, importing trace_sync if not already
present, and configure span_name (e.g.,
"knowledge.create_document_with_content"), tracer_name (matching existing
service tracer), and an extract_attributes callable that captures key attributes
such as knowledge_base_id, user.id, source_type, and trigger_indexing so the
entire ingestion path is traced; ensure the decorator is applied to the method
signature in orchestrator.py (create_document_with_content).
---
Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 203-204: The timestamped filename created by
dingtalk_docs_service.build_filename(title, modified_time) is computed into the
variable filename but never used; update the persistence/upload call that
currently passes name=title to instead pass name=filename (i.e., use the
filename variable), ensuring the stored/uploaded document follows the
{title}_{timestamp}.md convention; look for usages around the call that sets
filename and the subsequent persist/upload function where name=title and replace
that argument.
- Around line 33-41: The three newly added MCP handler functions
get_dingtalk_document_info, add_dingtalk_doc_to_knowledge, and
add_dingtalk_doc_with_attachment need OpenTelemetry decorators; wrap each async
handler with `@trace_async`(span_name, tracer_name, extract_attributes) and any
sync handler with `@trace_sync`(...) so their execution is observable: add
`@trace_async` for get_dingtalk_document_info and for any other async functions
(e.g., add_dingtalk_doc_with_attachment if async), and use `@trace_sync` for
synchronous handlers (e.g., add_dingtalk_doc_to_knowledge if sync), providing a
clear span_name (e.g., "mcp.get_dingtalk_document_info"), the appropriate
tracer_name, and an extract_attributes lambda that pulls relevant params (like
doc_url, node_id) to ensure consistent tracing across these handlers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5f9854a9-9fbf-4ad0-916b-52645591f987
📒 Files selected for processing (2)
backend/app/mcp_server/tools/dingtalk_docs.pybackend/app/services/knowledge/orchestrator.py
| return { | ||
| "success": True, | ||
| "document_id": result.id, | ||
| "document_name": result.name, | ||
| "attachment_id": attachment_id, | ||
| "message": f"Document '{doc_title}' added to knowledge base successfully", | ||
| } |
There was a problem hiding this comment.
Returned attachment_id may be inaccurate for the created document.
This response returns the input attachment_id, but the attachment source path creates a copied attachment in orchestrator (backend/app/services/knowledge/orchestrator.py, Line 1216-1231). Return result.attachment_id to reflect the actual linked attachment.
Suggested fix
return {
"success": True,
"document_id": result.id,
"document_name": result.name,
- "attachment_id": attachment_id,
+ "attachment_id": result.attachment_id,
"message": f"Document '{doc_title}' added to knowledge base successfully",
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return { | |
| "success": True, | |
| "document_id": result.id, | |
| "document_name": result.name, | |
| "attachment_id": attachment_id, | |
| "message": f"Document '{doc_title}' added to knowledge base successfully", | |
| } | |
| return { | |
| "success": True, | |
| "document_id": result.id, | |
| "document_name": result.name, | |
| "attachment_id": result.attachment_id, | |
| "message": f"Document '{doc_title}' added to knowledge base successfully", | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 317 - 323, The
response currently returns the input attachment_id which can be stale because
the orchestrator (backend/app/services/knowledge/orchestrator.py) copies the
attachment and links the document to a new attachment; update the return payload
in the function that constructs the response (the block using result, doc_title
and attachment_id in backend/app/mcp_server/tools/dingtalk_docs.py) to return
the actual linked attachment id from the created document (use
result.attachment_id) instead of the original input attachment_id; also ensure
result has an attachment_id attribute before returning (or fall back
appropriately).
Change filename format from {title}_{timestamp}.md to {title}.{file_extension}
to better align with DingTalk document info response structure.
The file_extension is now obtained directly from get_document_info response,
providing a cleaner and more intuitive filename format.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
backend/app/mcp_server/tools/dingtalk_docs.py (3)
140-142:⚠️ Potential issue | 🔴 CriticalDo not pass
db/useracrossasyncio.to_thread.
db(SQLAlchemy session) anduser(ORM object) are created on the main thread and then used in a worker thread. This is thread-unsafe and can cause runtime/data consistency failures.Proposed fix (keep blocking work off-loop, but create thread-local DB/ORM objects)
@@ - result = await asyncio.to_thread( - knowledge_orchestrator.create_document_with_content, - db=db, - user=user, - knowledge_base_id=knowledge_base_id, - name=title, - source_type="text", - content=doc_content, - file_extension=file_extension, - trigger_indexing=trigger_indexing, - trigger_summary=trigger_summary, - source_config=source_config, - ) + def _create_document_in_thread(): + thread_db = SessionLocal() + try: + thread_user = _get_user_from_token(thread_db, token_info) + if not thread_user: + raise ValueError("User not found") + return knowledge_orchestrator.create_document_with_content( + db=thread_db, + user=thread_user, + knowledge_base_id=knowledge_base_id, + name=title, + source_type="text", + content=doc_content, + file_extension=file_extension, + trigger_indexing=trigger_indexing, + trigger_summary=trigger_summary, + source_config=source_config, + ) + finally: + thread_db.close() + + result = await asyncio.to_thread(_create_document_in_thread)#!/bin/bash set -euo pipefail echo "Check to_thread callsite and passed args:" rg -n -C4 'asyncio\.to_thread\(' backend/app/mcp_server/tools/dingtalk_docs.py echo echo "Check orchestrator signature and DB/ORM expectations:" rg -n -C10 'def create_document_with_content\(' backend/app/services/knowledge/orchestrator.pyAlso applies to: 207-219
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 140 - 142, You are passing a SessionLocal() instance and an ORM user object from the main thread into asyncio.to_thread, which is thread-unsafe; instead, change the to_thread callsite so it does not receive db or ORM objects—pass only primitives (e.g., user id or token_info) and move SessionLocal() creation and the ORM query (the equivalent of _get_user_from_token) into the worker thread (or open a new session inside create_document_with_content / the threaded helper). Locate the asyncio.to_thread usage in dingtalk_docs.py and ensure the worker function creates its own SessionLocal(), re-queries the User by id (or re-validates token), and closes the session; apply the same fix pattern to the similar callsite around the create_document_with_content orchestration mentioned in the comment.
302-307:⚠️ Potential issue | 🟡 MinorReturn the created document’s attachment id, not the input id.
The response currently echoes the input
attachment_id; this can become inaccurate if the create flow remaps/copies attachments.Proposed fix
return { "success": True, "document_id": result.id, "document_name": result.name, - "attachment_id": attachment_id, + "attachment_id": getattr(result, "attachment_id", attachment_id), "message": f"Document '{doc_title}' added to knowledge base successfully", }#!/bin/bash set -euo pipefail echo "Inspect attachment handling in orchestrator:" rg -n -C8 'attachment_id|source_type.*attachment|copy' backend/app/services/knowledge/orchestrator.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 302 - 307, The response is echoing the input attachment_id instead of the attachment id returned by the create flow; update the return dict in backend/app/mcp_server/tools/dingtalk_docs.py to use the created attachment id from the API result (e.g., use result.attachment_id or result.attachment.id or result.attachments[0].id as provided by the create call) instead of the input variable attachment_id, and add a safe fallback (None) if that field is missing so the returned "attachment_id" reflects the actual stored attachment; adjust tests or callers if they expect the old input value.
33-41:⚠️ Potential issue | 🟠 MajorAdd tracing decorators to all three MCP handlers.
These entrypoints are untraced, so the DingTalk import path is not visible in OpenTelemetry.
As per coding guidelines, "Use
@trace_async(span_name, tracer_name, extract_attributes)decorator to trace entire async functions in OpenTelemetry" and "Use@trace_sync(span_name, tracer_name, extract_attributes)decorator to trace entire sync functions in OpenTelemetry".Also applies to: 90-104, 239-251
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 33 - 41, Add OpenTelemetry tracing decorators to the three MCP handler entrypoints in this file: attach `@trace_async` to get_dingtalk_document_info and the two other handler functions defined around the ranges referenced (lines ~90-104 and ~239-251); for each async handler use `@trace_async` with span_name set to a descriptive value (e.g., "mcp.get_dingtalk_document_info" for get_dingtalk_document_info), tracer_name "mcp", and an extract_attributes callable that extracts the doc_url (or other relevant params) into attributes so the DingTalk import path and request context are visible in traces; if any handler is sync use `@trace_sync` with the same naming/attributes convention.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 11-12: The docs in backend/app/mcp_server/tools/dingtalk_docs.py
mention a stale naming convention `{title}_{timestamp}.md` but the code actually
builds filenames as `{title}.{file_extension}`; update the textual
descriptions/comments in all locations (including the occurrences referenced
around lines 11-12, 92-93, 120-122, 264-265) to reflect the current
implementation, i.e., change `{title}_{timestamp}.md` to
`{title}.{file_extension}` (or a wording like "saved as
{title}.{file_extension}") and ensure any examples or prose around functions
that construct filenames (e.g., the filename builder, save/upload helpers) are
consistent with the actual filename format.
---
Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 140-142: You are passing a SessionLocal() instance and an ORM user
object from the main thread into asyncio.to_thread, which is thread-unsafe;
instead, change the to_thread callsite so it does not receive db or ORM
objects—pass only primitives (e.g., user id or token_info) and move
SessionLocal() creation and the ORM query (the equivalent of
_get_user_from_token) into the worker thread (or open a new session inside
create_document_with_content / the threaded helper). Locate the
asyncio.to_thread usage in dingtalk_docs.py and ensure the worker function
creates its own SessionLocal(), re-queries the User by id (or re-validates
token), and closes the session; apply the same fix pattern to the similar
callsite around the create_document_with_content orchestration mentioned in the
comment.
- Around line 302-307: The response is echoing the input attachment_id instead
of the attachment id returned by the create flow; update the return dict in
backend/app/mcp_server/tools/dingtalk_docs.py to use the created attachment id
from the API result (e.g., use result.attachment_id or result.attachment.id or
result.attachments[0].id as provided by the create call) instead of the input
variable attachment_id, and add a safe fallback (None) if that field is missing
so the returned "attachment_id" reflects the actual stored attachment; adjust
tests or callers if they expect the old input value.
- Around line 33-41: Add OpenTelemetry tracing decorators to the three MCP
handler entrypoints in this file: attach `@trace_async` to
get_dingtalk_document_info and the two other handler functions defined around
the ranges referenced (lines ~90-104 and ~239-251); for each async handler use
`@trace_async` with span_name set to a descriptive value (e.g.,
"mcp.get_dingtalk_document_info" for get_dingtalk_document_info), tracer_name
"mcp", and an extract_attributes callable that extracts the doc_url (or other
relevant params) into attributes so the DingTalk import path and request context
are visible in traces; if any handler is sync use `@trace_sync` with the same
naming/attributes convention.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3a41dae1-1bac-41e6-9f88-456d717f0ae1
📒 Files selected for processing (1)
backend/app/mcp_server/tools/dingtalk_docs.py
…up code - Remove unused download_document method from DingTalkDocsService - Remove unused imports and constants from docs_service.py - Remove unused BaseModel import from dingtalk __init__.py - Clean up duplicate import in dingtalk_docs.py MCP tools - Update SKILL.md description to reflect actual workflow - Remove unused _download_document_content method from provider.py All changes are code cleanup with no functional impact.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (6)
backend/init_data/skills/dingtalk-connector/provider.py (2)
598-614:⚠️ Potential issue | 🟠 MajorHandle
/wiki/URLs in the download step too.
_get_document_info_from_mcp()already accepts DingTalk wiki links, but_download_document_content_real()only extracts IDs from/nodes/and/docs/. A valid wiki URL will therefore pass step 2 and fail immediately at step 3 with “Could not extract document ID”.Minimal fix
patterns = [ r"alidocs\.dingtalk\.com/i/nodes/([a-zA-Z0-9_-]+)", r"alidocs\.dingtalk\.com/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)", + r"alidocs\.dingtalk\.com/i/team/[^/]+/wiki/([a-zA-Z0-9_-]+)", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 598 - 614, _download_document_content_real() fails to extract IDs for DingTalk wiki links even though _get_document_info_from_mcp() accepts them; update the ID-extraction logic in _download_document_content_real() to include the wiki URL pattern used elsewhere (e.g., add a regex covering "/wiki/" URLs similar to the patterns for "/i/nodes/" and "/i/team/.../docs/") so that doc_id is correctly set for wiki links and the function no longer returns the "Could not extract document ID" error.
631-644:⚠️ Potential issue | 🔴 Critical
download_documentis not a registered MCP tool.The knowledge server registers the decorated tools from
backend/app/mcp_server/tools/dingtalk_docs.py, and that file exposesget_dingtalk_document_info,add_dingtalk_doc_to_knowledge, andadd_dingtalk_doc_with_attachment— notdownload_document. This request will fail as soon as the content-download step runs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 631 - 644, The payload is calling an unregistered MCP tool "download_document"; replace it with a registered tool (e.g., "add_dingtalk_doc_with_attachment" or "get_dingtalk_document_info") or register a new tool. Update the "name" field in the payload from "download_document" to one of the existing exported tools (suggest "add_dingtalk_doc_with_attachment" if you intend to upload the downloaded content into knowledge) and ensure the arguments match that tool's signature (use doc_id, url and any expected keys), or alternatively implement/register a new tool named "download_document" in backend/app/mcp_server/tools/dingtalk_docs.py if that behavior is required.backend/app/services/dingtalk/docs_service.py (2)
158-172:⚠️ Potential issue | 🟡 MinorChain the original
HTTPStatusErrorwhen re-raising.Both
except httpx.HTTPStatusError as eblocks discard the original traceback by raising a freshValueErrorwithoutfrom e, which makes MCP/download failures much harder to debug.Also applies to: 414-427
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 158 - 172, The except blocks catching httpx.HTTPStatusError (e) in docs_service.py currently re-raise new ValueError instances and discard the original exception context; update each re-raise in the HTTPStatusError handlers (the block shown and the similar block around lines 414-427) to chain the original exception by using "raise ValueError(... ) from e" so the original HTTPStatusError traceback is preserved; ensure you apply this to all branches (AUTH_ERROR_CODES branch, 406 branch, and the generic failure branch) that construct and raise ValueError.
272-277:⚠️ Potential issue | 🟡 MinorReject or honor unsupported
export_formatvalues.
export_formatis part of the public API here, but the method never validates it or passes it downstream. A caller asking forhtmlortxtcurrently gets whatever the MCP returns, whilefile_extensionis still inferred from the DingTalk metadata instead of the requested format.Also applies to: 339-412
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 272 - 277, download_document_content currently accepts an export_format but never validates or enforces it; update it (and the comparable methods in the 339-412 range) to validate export_format against an allowed set (e.g., {"markdown","html","txt","pdf"} or the formats supported by the MCP), raise a clear exception for unsupported values, and ensure the validated format is propagated to downstream requests and used to derive file_extension (instead of relying on DingTalk metadata); reference the download_document_content function name and the downstream call(s) that send export_format/file_extension so you can locate where to inject validation, mapping, and the error raise.backend/app/mcp_server/tools/dingtalk_docs.py (2)
300-305:⚠️ Potential issue | 🟡 MinorReturn the attachment actually linked to the new document.
Echoing the input
attachment_idhere can be stale if the orchestrator copies or rebinds the attachment during document creation. Returnresult.attachment_idso callers see the attachment that the new knowledge document actually references.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 300 - 305, The returned payload currently echoes the input attachment_id which can be stale; update the return dict in the function that builds the response (the block using variables result, doc_title, attachment_id) to return the attachment actually linked to the created document by using result.attachment_id for the "attachment_id" field instead of the input attachment_id so callers receive the real attachment bound to the new document.
205-217:⚠️ Potential issue | 🔴 CriticalKeep the SQLAlchemy session and ORM object on the same thread.
dbis created at Line 136 anduseris loaded from it before theasyncio.to_thread()call, but both are then used inside the worker thread. SQLAlchemy sessions and ORM instances are not thread-safe, so this can break oncecreate_document_with_content()touches the session in that background thread.Safer pattern
- result = await asyncio.to_thread( - knowledge_orchestrator.create_document_with_content, - db=db, - user=user, - knowledge_base_id=knowledge_base_id, - name=title, - source_type="text", - content=doc_content, - file_extension=file_extension, - trigger_indexing=trigger_indexing, - trigger_summary=trigger_summary, - source_config=source_config, - ) + def _create_document() -> Any: + thread_db = SessionLocal() + try: + thread_user = _get_user_from_token(thread_db, token_info) + if not thread_user: + raise ValueError("User not found") + return knowledge_orchestrator.create_document_with_content( + db=thread_db, + user=thread_user, + knowledge_base_id=knowledge_base_id, + name=title, + source_type="text", + content=doc_content, + file_extension=file_extension, + trigger_indexing=trigger_indexing, + trigger_summary=trigger_summary, + source_config=source_config, + ) + finally: + thread_db.close() + + result = await asyncio.to_thread(_create_document)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 205 - 217, The code passes a SQLAlchemy Session object `db` and an ORM `user` into `asyncio.to_thread()` where `knowledge_orchestrator.create_document_with_content` runs; SQLAlchemy sessions/ORM instances are not thread-safe—either perform all DB/ORM work on the current async thread (call `create_document_with_content` directly without `asyncio.to_thread`) or pass only primitive data into the worker and open a new Session there. Concretely, extract any IDs/primitive fields needed from `user` (e.g., `user.id`) and other DB-derived values before calling `asyncio.to_thread`, or move the call out of `asyncio.to_thread` and `await knowledge_orchestrator.create_document_with_content(db=..., user=...)` on the original thread; alternatively, modify `create_document_with_content` to accept primitives and create its own session inside the worker thread.
🧹 Nitpick comments (1)
backend/app/services/dingtalk/docs_service.py (1)
207-211: Trace the new async service entrypoints.
get_document_info()anddownload_document_content()are the backend entrypoints for this feature, but neither emits a span today, so DingTalk fetch/download failures will be missing from the trace tree.As per coding guidelines, "Use
@trace_async(span_name, tracer_name, extract_attributes)decorator to trace entire async functions in OpenTelemetry".Also applies to: 272-277
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/dingtalk/docs_service.py` around lines 207 - 211, Add OpenTelemetry tracing to the async entrypoints by importing and applying the trace_async decorator: add `@trace_async`("dingtalk.get_document_info", "dingtalk", extract_attributes=...) above the async def get_document_info(...) and add `@trace_async`("dingtalk.download_document_content", "dingtalk", extract_attributes=...) above async def download_document_content(...); ensure trace_async is imported from the tracing/observability helper used elsewhere in the repo and have the extract_attributes include relevant params (e.g., doc_url and user_preferences for get_document_info, doc_url for download_document_content) so failures appear in the trace tree.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 317-320: The code constructs file_path from a user-supplied title
via _build_filename(title, file_extension) which allows path traversal; sanitize
the title first (e.g., strip/replace path separators and disallow .., or use a
safe helper like os.path.basename or a slug/safe-filename function) before
calling _build_filename, then build the path with os.path.join(base_dir,
filename) and validate with os.path.realpath or pathlib.Path.resolve() that the
resulting path is inside the intended base directory; if validation fails,
reject the title or raise an error.
---
Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 300-305: The returned payload currently echoes the input
attachment_id which can be stale; update the return dict in the function that
builds the response (the block using variables result, doc_title, attachment_id)
to return the attachment actually linked to the created document by using
result.attachment_id for the "attachment_id" field instead of the input
attachment_id so callers receive the real attachment bound to the new document.
- Around line 205-217: The code passes a SQLAlchemy Session object `db` and an
ORM `user` into `asyncio.to_thread()` where
`knowledge_orchestrator.create_document_with_content` runs; SQLAlchemy
sessions/ORM instances are not thread-safe—either perform all DB/ORM work on the
current async thread (call `create_document_with_content` directly without
`asyncio.to_thread`) or pass only primitive data into the worker and open a new
Session there. Concretely, extract any IDs/primitive fields needed from `user`
(e.g., `user.id`) and other DB-derived values before calling
`asyncio.to_thread`, or move the call out of `asyncio.to_thread` and `await
knowledge_orchestrator.create_document_with_content(db=..., user=...)` on the
original thread; alternatively, modify `create_document_with_content` to accept
primitives and create its own session inside the worker thread.
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 158-172: The except blocks catching httpx.HTTPStatusError (e) in
docs_service.py currently re-raise new ValueError instances and discard the
original exception context; update each re-raise in the HTTPStatusError handlers
(the block shown and the similar block around lines 414-427) to chain the
original exception by using "raise ValueError(... ) from e" so the original
HTTPStatusError traceback is preserved; ensure you apply this to all branches
(AUTH_ERROR_CODES branch, 406 branch, and the generic failure branch) that
construct and raise ValueError.
- Around line 272-277: download_document_content currently accepts an
export_format but never validates or enforces it; update it (and the comparable
methods in the 339-412 range) to validate export_format against an allowed set
(e.g., {"markdown","html","txt","pdf"} or the formats supported by the MCP),
raise a clear exception for unsupported values, and ensure the validated format
is propagated to downstream requests and used to derive file_extension (instead
of relying on DingTalk metadata); reference the download_document_content
function name and the downstream call(s) that send export_format/file_extension
so you can locate where to inject validation, mapping, and the error raise.
In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 598-614: _download_document_content_real() fails to extract IDs
for DingTalk wiki links even though _get_document_info_from_mcp() accepts them;
update the ID-extraction logic in _download_document_content_real() to include
the wiki URL pattern used elsewhere (e.g., add a regex covering "/wiki/" URLs
similar to the patterns for "/i/nodes/" and "/i/team/.../docs/") so that doc_id
is correctly set for wiki links and the function no longer returns the "Could
not extract document ID" error.
- Around line 631-644: The payload is calling an unregistered MCP tool
"download_document"; replace it with a registered tool (e.g.,
"add_dingtalk_doc_with_attachment" or "get_dingtalk_document_info") or register
a new tool. Update the "name" field in the payload from "download_document" to
one of the existing exported tools (suggest "add_dingtalk_doc_with_attachment"
if you intend to upload the downloaded content into knowledge) and ensure the
arguments match that tool's signature (use doc_id, url and any expected keys),
or alternatively implement/register a new tool named "download_document" in
backend/app/mcp_server/tools/dingtalk_docs.py if that behavior is required.
---
Nitpick comments:
In `@backend/app/services/dingtalk/docs_service.py`:
- Around line 207-211: Add OpenTelemetry tracing to the async entrypoints by
importing and applying the trace_async decorator: add
`@trace_async`("dingtalk.get_document_info", "dingtalk", extract_attributes=...)
above the async def get_document_info(...) and add
`@trace_async`("dingtalk.download_document_content", "dingtalk",
extract_attributes=...) above async def download_document_content(...); ensure
trace_async is imported from the tracing/observability helper used elsewhere in
the repo and have the extract_attributes include relevant params (e.g., doc_url
and user_preferences for get_document_info, doc_url for
download_document_content) so failures appear in the trace tree.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 587d57d3-0dbb-45ad-b12c-c8dae808a360
📒 Files selected for processing (5)
backend/app/mcp_server/tools/dingtalk_docs.pybackend/app/services/dingtalk/__init__.pybackend/app/services/dingtalk/docs_service.pybackend/init_data/skills/dingtalk-connector/SKILL.mdbackend/init_data/skills/dingtalk-connector/provider.py
✅ Files skipped from review due to trivial changes (2)
- backend/app/services/dingtalk/init.py
- backend/init_data/skills/dingtalk-connector/SKILL.md
| from app.services.knowledge.orchestrator import _build_filename | ||
|
|
||
| filename = _build_filename(title, file_extension) | ||
| file_path = f"/home/user/{filename}" |
There was a problem hiding this comment.
Sanitize the title before turning it into a sandbox path.
_build_filename() in backend/app/services/knowledge/orchestrator.py:84-91 only appends the extension, so a title like ../../tmp/payload produces /home/user/../../tmp/payload.md here. That lets a user-supplied or remote DingTalk title escape the intended workspace path.
Suggested hardening
+import re
@@
- filename = _build_filename(title, file_extension)
+ safe_title = re.sub(r"[\\/]+", "_", title).strip().strip(".")
+ if not safe_title:
+ error_msg = "Document title produced an invalid filename"
+ logger.error(f"[DingTalkDocToKBTool] {error_msg}")
+ result = self._format_error(error_message=error_msg)
+ await self._emit_tool_status("failed", error_msg)
+ return result
+
+ filename = _build_filename(safe_title, file_extension)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 317 -
320, The code constructs file_path from a user-supplied title via
_build_filename(title, file_extension) which allows path traversal; sanitize the
title first (e.g., strip/replace path separators and disallow .., or use a safe
helper like os.path.basename or a slug/safe-filename function) before calling
_build_filename, then build the path with os.path.join(base_dir, filename) and
validate with os.path.realpath or pathlib.Path.resolve() that the resulting path
is inside the intended base directory; if validation fails, reject the title or
raise an error.
- Add trace_async/trace_sync decorators to MCP handlers in dingtalk_docs.py - Add trace_async decorator to dingtalk-connector provider _arun method - Fix URL extraction in _download_document_content_real to support wiki links - Add new download_dingtalk_document MCP tool for skill content download - Fix filename usage in add_dingtalk_doc_to_knowledge to use built filename - Fix MCP payload tool name from non-existent download_document to download_dingtalk_document
|
@coderabbitai resume |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
backend/init_data/skills/dingtalk-connector/provider.py (3)
62-79: Useraise ... from errfor better exception chaining.When re-raising an exception within an
exceptblock, useraise ... from err(orraise ... from None) to preserve the exception chain and distinguish the new exception from errors in exception handling.♻️ Suggested fix
except ImportError: # Try absolute import (for dynamic loading) import sys # Get the package name dynamically package_name = __name__.rsplit(".", 1)[0] _base_module = sys.modules.get(f"{package_name}._base") if _base_module: BaseSandboxTool = _base_module.BaseSandboxTool else: - raise ImportError( + raise ImportError( "Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base" - ) + ) from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 62 - 79, The ImportError re-raise in the fallback import path should chain the original exception; in the except ImportError block that constructs package_name and looks up _base_module (and ultimately raises ImportError("Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base")), change the re-raise to preserve the original error by using "raise ImportError(... ) from err" (or from the caught exception variable) so the original ImportError is chained to the new one when BaseSandboxTool cannot be resolved; locate the try/except around the import of BaseSandboxTool and update the final raise to use exception chaining.
554-598: Remove or deprecate dead placeholder code.
_download_document_content()is never called — the workflow uses_download_document_content_real()instead. Keeping this placeholder creates confusion about which method is the actual implementation.♻️ Suggested action
Either remove
_download_document_contententirely, or rename_download_document_content_realto_download_document_contentand delete the placeholder.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 554 - 598, The placeholder method _download_document_content is dead code and conflicts with the real implementation _download_document_content_real; either remove _download_document_content entirely, or move/rename the real implementation by renaming _download_document_content_real to _download_document_content and delete the placeholder, then update any internal references to use the canonical name so only the real implementation remains; ensure imports/uses refer to the chosen name and run tests to confirm no references remain to the removed function.
540-550: Add logging to silent exception handlers.Multiple
try-except-passblocks silently swallow exceptions during temp file cleanup (lines 548-549, 728-729, 897-898, 908-909). While cleanup failures may be non-critical, logging at DEBUG level helps diagnose sandbox issues.♻️ Example fix for one instance
finally: # Clean up temp file try: await sandbox.commands.run( cmd=f"rm -f {payload_file}", cwd="/home/user", timeout=10, ) - except Exception: - pass + except Exception as cleanup_err: + logger.debug(f"[DingTalkDocToKBTool] Temp file cleanup failed: {cleanup_err}")Also applies to: 720-730, 890-910
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 540 - 550, The cleanup code currently swallows exceptions in multiple finally/except-pass blocks around sandbox.commands.run (e.g., the rm -f {payload_file} call and similar temp-file cleanup calls), so add logging at DEBUG level in those except handlers: create/get a module logger (e.g., logger = logging.getLogger(__name__)) and replace the bare "except Exception: pass" with a handler that logs a clear message including the resource name (payload_file or other temp filename) and the caught exception (use exc_info=True or include traceback). Apply the same change to all similar blocks that silently pass (the cleanup calls around sandbox.commands.run and temp file removals referenced in the review).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 704-711: The current return only propagates "content" from the MCP
tool result, causing callers to default the file extension to "md"; update the
handler (the block that processes the result of download_dingtalk_document / the
tool_result handling in provider.py) to also extract file_extension from
tool_result (e.g., tool_result.get("file_extension")) and include it in the
returned dict (e.g., return {"success": True, "content": content,
"file_extension": file_extension}), and when missing, supply a sensible default
or None so the caller no longer always falls back to ".md".
---
Nitpick comments:
In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 62-79: The ImportError re-raise in the fallback import path should
chain the original exception; in the except ImportError block that constructs
package_name and looks up _base_module (and ultimately raises
ImportError("Cannot import BaseSandboxTool from
chat_shell.tools.sandbox._base")), change the re-raise to preserve the original
error by using "raise ImportError(... ) from err" (or from the caught exception
variable) so the original ImportError is chained to the new one when
BaseSandboxTool cannot be resolved; locate the try/except around the import of
BaseSandboxTool and update the final raise to use exception chaining.
- Around line 554-598: The placeholder method _download_document_content is dead
code and conflicts with the real implementation _download_document_content_real;
either remove _download_document_content entirely, or move/rename the real
implementation by renaming _download_document_content_real to
_download_document_content and delete the placeholder, then update any internal
references to use the canonical name so only the real implementation remains;
ensure imports/uses refer to the chosen name and run tests to confirm no
references remain to the removed function.
- Around line 540-550: The cleanup code currently swallows exceptions in
multiple finally/except-pass blocks around sandbox.commands.run (e.g., the rm -f
{payload_file} call and similar temp-file cleanup calls), so add logging at
DEBUG level in those except handlers: create/get a module logger (e.g., logger =
logging.getLogger(__name__)) and replace the bare "except Exception: pass" with
a handler that logs a clear message including the resource name (payload_file or
other temp filename) and the caught exception (use exc_info=True or include
traceback). Apply the same change to all similar blocks that silently pass (the
cleanup calls around sandbox.commands.run and temp file removals referenced in
the review).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fbc39a5f-86e1-4f12-929e-d4badf7a88e0
📒 Files selected for processing (2)
backend/app/mcp_server/tools/dingtalk_docs.pybackend/init_data/skills/dingtalk-connector/provider.py
✅ Actions performedReviews resumed. |
There was a problem hiding this comment.
🧹 Nitpick comments (4)
backend/init_data/skills/dingtalk-connector/provider.py (3)
62-79: Useraise ... from Nonefor cleaner exception chaining.When raising a new
ImportErrorafter failing to find the module dynamically, useraise ... from Noneto suppress the confusing chained exception context.♻️ Suggested fix
if _base_module: BaseSandboxTool = _base_module.BaseSandboxTool else: - raise ImportError( + raise ImportError( "Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base" - ) + ) from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 62 - 79, The ImportError raised when _base_module is not found should suppress chained exception context; update the else branch where you raise ImportError("Cannot import BaseSandboxTool from chat_shell.tools.sandbox._base") to use "raise ImportError(... ) from None" so the new ImportError is raised without showing the previous exception context—make this change in the block that resolves BaseSandboxTool (variables: package_name, _base_module, BaseSandboxTool).
554-598: Remove unused placeholder method.
_download_document_contentreturns hardcoded placeholder content and is not called anywhere—the actual implementation is_download_document_content_real(used at line 309). Remove this dead code to avoid confusion.🗑️ Remove dead code
- async def _download_document_content(self, sandbox: Any, doc_url: str) -> dict: - """Download document content from DingTalk. - - For now, this returns placeholder content. - In production, this would call the DingTalk API via MCP. - """ - try: - # Extract doc ID for content generation - import re - - doc_id = None - patterns = [ - r"alidocs\.dingtalk\.com/i/nodes/([a-zA-Z0-9_-]+)", - r"alidocs\.dingtalk\.com/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)", - ] - - for pattern in patterns: - match = re.search(pattern, doc_url) - if match: - doc_id = match.group(1) - break - - # Generate placeholder content - content = f"""# DingTalk Document - -This document was imported from DingTalk. - -**Source URL:** {doc_url} -**Document ID:** {doc_id or "unknown"} -**Imported at:** {datetime.now().isoformat()} - -## Content - -The actual content would be fetched from DingTalk API in production. -This is a placeholder for the document content. - ---- -*Imported by Wegent DingTalk Docs Skill* -""" - - return {"success": True, "content": content} - - except Exception as e: - return {"success": False, "error": str(e)}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 554 - 598, Remove the unused placeholder function _download_document_content from the DingTalk provider (it returns hardcoded content and is not referenced); locate and delete the async def _download_document_content(self, sandbox: Any, doc_url: str) block, ensuring only the real implementation _download_document_content_real remains and that no other code references the removed symbol (search for _download_document_content to confirm).
418-553: Extract duplicated URL parsing logic into a helper.The URL pattern matching to extract
doc_idis duplicated in_get_document_info_from_mcp(lines 428-438),_download_document_content(lines 565-574), and_download_document_content_real(lines 609-619). Extract this into a private helper method to reduce duplication and ensure consistent pattern support.♻️ Example helper extraction
def _extract_doc_id_from_url(self, doc_url: str) -> Optional[str]: """Extract document ID from DingTalk URL. Supports nodes, docs, and wiki URL formats. """ import re patterns = [ r"alidocs\.dingtalk\.com/i/nodes/([a-zA-Z0-9_-]+)", r"alidocs\.dingtalk\.com/i/team/[^/]+/docs/([a-zA-Z0-9_-]+)", r"alidocs\.dingtalk\.com/i/team/[^/]+/wiki/([a-zA-Z0-9_-]+)", ] for pattern in patterns: match = re.search(pattern, doc_url) if match: return match.group(1) return NoneThen replace the duplicated pattern matching in each method with:
doc_id = self._extract_doc_id_from_url(doc_url) if not doc_id: return {"success": False, "error": f"Could not extract document ID from URL: {doc_url}"}Also applies to: 599-733
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/init_data/skills/dingtalk-connector/provider.py` around lines 418 - 553, The URL extraction regex is duplicated across _get_document_info_from_mcp, _download_document_content, and _download_document_content_real; add a private helper method (e.g., _extract_doc_id_from_url(self, doc_url: str) -> Optional[str]) that contains the three regex patterns and returns the matched group or None, then replace each local regex block with a call to this helper and preserve the existing error handling (returning {"success": False, "error": ...} when None); update the three methods to use the helper to ensure consistent behavior and remove the duplicated imports/usages of re and repeated pattern lists.backend/app/mcp_server/tools/dingtalk_docs.py (1)
200-212: Move imports to module level.
_build_filename(line 201) andasyncio(line 212) are imported inside the function. Move these to the top of the file for consistency and to avoid repeated import overhead.Additionally,
_build_filenameis a private function (underscore prefix). Consider whether this should be exposed as a public API or if the filename building logic should be duplicated here to avoid depending on private internals that may change.♻️ Suggested refactor
import logging +import asyncio from typing import Any, Dict, Optional from app.db.session import SessionLocal from app.mcp_server.auth import TaskTokenInfo from app.mcp_server.tools.decorator import build_mcp_tools_dict, mcp_tool from app.mcp_server.tools.knowledge import _get_user_from_token from app.services.dingtalk.docs_service import dingtalk_docs_service -from app.services.knowledge.orchestrator import knowledge_orchestrator +from app.services.knowledge.orchestrator import knowledge_orchestrator, _build_filename from shared.telemetry.decorators import trace_async, trace_syncThen remove the inline imports at lines 201 and 212.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 200 - 212, Move the inline imports to module level: add "from app.services.knowledge.orchestrator import _build_filename" and "import asyncio" at the top of the module and remove the in-function imports around the calls to _build_filename and asyncio; also evaluate whether to replace usage of the private helper _build_filename (underscore-prefixed) by promoting it to a public function (rename/export) or by duplicating the filename-building logic locally so the module does not depend on a private internal API.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 200-212: Move the inline imports to module level: add "from
app.services.knowledge.orchestrator import _build_filename" and "import asyncio"
at the top of the module and remove the in-function imports around the calls to
_build_filename and asyncio; also evaluate whether to replace usage of the
private helper _build_filename (underscore-prefixed) by promoting it to a public
function (rename/export) or by duplicating the filename-building logic locally
so the module does not depend on a private internal API.
In `@backend/init_data/skills/dingtalk-connector/provider.py`:
- Around line 62-79: The ImportError raised when _base_module is not found
should suppress chained exception context; update the else branch where you
raise ImportError("Cannot import BaseSandboxTool from
chat_shell.tools.sandbox._base") to use "raise ImportError(... ) from None" so
the new ImportError is raised without showing the previous exception
context—make this change in the block that resolves BaseSandboxTool (variables:
package_name, _base_module, BaseSandboxTool).
- Around line 554-598: Remove the unused placeholder function
_download_document_content from the DingTalk provider (it returns hardcoded
content and is not referenced); locate and delete the async def
_download_document_content(self, sandbox: Any, doc_url: str) block, ensuring
only the real implementation _download_document_content_real remains and that no
other code references the removed symbol (search for _download_document_content
to confirm).
- Around line 418-553: The URL extraction regex is duplicated across
_get_document_info_from_mcp, _download_document_content, and
_download_document_content_real; add a private helper method (e.g.,
_extract_doc_id_from_url(self, doc_url: str) -> Optional[str]) that contains the
three regex patterns and returns the matched group or None, then replace each
local regex block with a call to this helper and preserve the existing error
handling (returning {"success": False, "error": ...} when None); update the
three methods to use the helper to ensure consistent behavior and remove the
duplicated imports/usages of re and repeated pattern lists.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8ff6ace5-a948-4e1e-891d-a0f8b4dcba91
📒 Files selected for processing (2)
backend/app/mcp_server/tools/dingtalk_docs.pybackend/init_data/skills/dingtalk-connector/provider.py
Remove asyncio.to_thread() call that was passing SQLAlchemy Session and ORM User objects across thread boundaries. SQLAlchemy objects are not thread-safe and should not be passed between threads. Instead, call create_document_with_content() directly since it's a synchronous function that executes database operations within the same thread context where the Session was created. This aligns with the pattern used in other MCP tools (knowledge.py) which also call orchestrator methods directly without using to_thread.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
backend/app/mcp_server/tools/dingtalk_docs.py (1)
395-400:⚠️ Potential issue | 🟡 MinorReturn the actual linked attachment id from the created document.
This response echoes the input
attachment_id, which may differ from the attachment actually linked by orchestrator.Suggested fix
return { "success": True, "document_id": result.id, "document_name": result.name, - "attachment_id": attachment_id, + "attachment_id": ( + result.attachment_id + if getattr(result, "attachment_id", None) is not None + else attachment_id + ), "message": f"Document '{doc_title}' added to knowledge base successfully", }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 395 - 400, The return currently echoes the input attachment_id rather than the actual attachment linked by the orchestrator; update the return to read the linked attachment id from the created document object (the result variable) — e.g. use result.attachment_id or result.linked_attachment.id or result.linked_attachment_id (whichever exists on the returned object) and fall back to the original attachment_id if none is present, then return that value in the "attachment_id" field along with document_id/result.id and document_name/result.name.
🧹 Nitpick comments (1)
backend/app/mcp_server/tools/dingtalk_docs.py (1)
167-189: Narrow the inner exception handling around DingTalk fetch.Catching
Exceptionhere masks unexpected programming/runtime errors and duplicates outer error handling. Prefer catching expected fetch failures only (e.g.,ValueError) and let unknown exceptions bubble to the outerexceptwith stack trace.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 167 - 189, The current broad except in the block calling dingtalk_docs_service.download_document_content masks unexpected errors; narrow the handler to only expected fetch failures (e.g., ValueError or a specific service fetch exception) around the call to dingtalk_docs_service.download_document_content and related parsing (doc_download, doc_content, file_extension, modified_time, update_time), log and return the graceful error only for those expected exceptions, and let other exceptions propagate (or re-raise) so the outer exception handler can capture stack traces.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 43-45: The current extract_attributes lambda stores the full
doc_url (which may include sensitive query params) in traces/logs; change it to
redact query strings or emit only a parsed doc_id instead: parse the provided
doc_url inside extract_attributes (and the other extract_attributes occurrences)
to remove the query component (or extract the document identifier) and set the
attribute to either "doc_id": <extracted_id> or "doc_url": <url_without_query>
before returning, so no raw query parameters are logged.
- Around line 178-180: The code branch in download_document_content (or the
caller handling doc_download) incorrectly reads modified_time_formatted, leaving
modified_time unset; change the assignment to use the actual DingTalk key
"modified_time" (e.g., set modified_time = doc_download.get("modified_time")) so
modified_time is populated when download_document_content() returns
modified_time rather than modified_time_formatted.
---
Duplicate comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 395-400: The return currently echoes the input attachment_id
rather than the actual attachment linked by the orchestrator; update the return
to read the linked attachment id from the created document object (the result
variable) — e.g. use result.attachment_id or result.linked_attachment.id or
result.linked_attachment_id (whichever exists on the returned object) and fall
back to the original attachment_id if none is present, then return that value in
the "attachment_id" field along with document_id/result.id and
document_name/result.name.
---
Nitpick comments:
In `@backend/app/mcp_server/tools/dingtalk_docs.py`:
- Around line 167-189: The current broad except in the block calling
dingtalk_docs_service.download_document_content masks unexpected errors; narrow
the handler to only expected fetch failures (e.g., ValueError or a specific
service fetch exception) around the call to
dingtalk_docs_service.download_document_content and related parsing
(doc_download, doc_content, file_extension, modified_time, update_time), log and
return the graceful error only for those expected exceptions, and let other
exceptions propagate (or re-raise) so the outer exception handler can capture
stack traces.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 68acdb67-cbb6-462a-9a2b-291a89899ebf
📒 Files selected for processing (2)
backend/app/mcp_server/tools/dingtalk_docs.pybackend/app/services/knowledge/orchestrator.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/app/services/knowledge/orchestrator.py
| extract_attributes=lambda token_info, doc_url: { | ||
| "doc_url": doc_url, | ||
| }, |
There was a problem hiding this comment.
Avoid storing full doc_url in traces/logs.
Raw DingTalk URLs can carry sensitive query parameters. Redact query strings (or emit only doc_id) before logging/telemetry attributes.
Suggested fix
+from urllib.parse import urlsplit, urlunsplit
+
+def _safe_url_for_observability(url: str) -> str:
+ parts = urlsplit(url)
+ return urlunsplit((parts.scheme, parts.netloc, parts.path, "", ""))
`@trace_async`(
@@
- extract_attributes=lambda token_info, doc_url: {
- "doc_url": doc_url,
+ extract_attributes=lambda token_info, doc_url: {
+ "doc_url": _safe_url_for_observability(doc_url),
},
)
@@
- logger.info(
- f"[MCP] Content not provided, fetching from DingTalk: {doc_url}"
- )
+ logger.info(
+ "[MCP] Content not provided, fetching from DingTalk: %s",
+ _safe_url_for_observability(doc_url),
+ )Also applies to: 111-114, 164-166, 264-266
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 43 - 45, The
current extract_attributes lambda stores the full doc_url (which may include
sensitive query params) in traces/logs; change it to redact query strings or
emit only a parsed doc_id instead: parse the provided doc_url inside
extract_attributes (and the other extract_attributes occurrences) to remove the
query component (or extract the document identifier) and set the attribute to
either "doc_id": <extracted_id> or "doc_url": <url_without_query> before
returning, so no raw query parameters are logged.
| if not modified_time: | ||
| modified_time = doc_download.get("modified_time_formatted") | ||
| # Get file_extension from DingTalk response |
There was a problem hiding this comment.
Use the actual DingTalk download key for modified time.
download_document_content() returns modified_time, not modified_time_formatted. This branch leaves modified_time unset unexpectedly.
Suggested fix
- if not modified_time:
- modified_time = doc_download.get("modified_time_formatted")
+ if not modified_time:
+ modified_time = doc_download.get("modified_time")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/mcp_server/tools/dingtalk_docs.py` around lines 178 - 180, The
code branch in download_document_content (or the caller handling doc_download)
incorrectly reads modified_time_formatted, leaving modified_time unset; change
the assignment to use the actual DingTalk key "modified_time" (e.g., set
modified_time = doc_download.get("modified_time")) so modified_time is populated
when download_document_content() returns modified_time rather than
modified_time_formatted.
|
close |
Summary
Add support for importing DingTalk documents into Wegent knowledge bases.
Features
MCP Tools
get_dingtalk_document_info: Get document metadata from URLadd_dingtalk_doc_to_knowledge: Add document with text contentadd_dingtalk_doc_with_attachment: Add document using attachmentDingTalk Docs Service
{title}_{timestamp}.mdDingTalk Docs Skill
dingtalk_doc_to_kbtool for complete workflow:{title}_{timestamp}.mdNaming Convention
Documents are saved with the format:
{title}_{timestamp}.md产品需求文档_20260413170933.mdYYYYMMDDHHMMSSFiles Added
app/services/dingtalk/__init__.pyapp/services/dingtalk/docs_service.pyapp/mcp_server/tools/dingtalk_docs.pyinit_data/skills/dingtalk-docs/SKILL.mdinit_data/skills/dingtalk-docs/__init__.pyinit_data/skills/dingtalk-docs/provider.pyFiles Modified
app/mcp_server/server.py(import dingtalk_docs module)Summary by CodeRabbit